elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.76k stars 8.17k forks source link

Investigate Kibana Alerting amount of configuration pulls #161382

Open philippkahr opened 1 year ago

philippkahr commented 1 year ago

Kibana: 8.8.1

When looking at the instrumentation of Kibana Node.JS I can see what is happening behind the scenes when a simple Kibana ES Query alert is run.

I observe a total of 5 calls to Elasticsearch: GET /.kibana_8.8.1/_doc/strava%3Aconfig%3A8.8.1 which don't make that much sense, especially as it seems those are always wrapped with has_privileges calls. Shouldn't that be handled by single call? It's a bit related to this discussion: https://github.com/elastic/kibana/issues/161229

When looking at all spans that are part of this the 5x config pull might still seem insignificant, but that is only because it is an unused and bored cluster. Imagine a cluster that is under heavy load and needs to provide 5 times the same data, accompanied with the has_privileges calls.

image

platform-metrics kb europe-west3 gcp cloud es io_app_apm_services_kibana_transactions_view_kuery=labels deploymentId_%20%2212a0e5b525c14e57b156463ee7c8af67%22 rangeFrom=now-24h%2Fh rangeTo=now environment=ENVIRONMENT_ALL serviceGroup= compa

elasticmachine commented 1 year ago

Pinging @elastic/response-ops (Team:ResponseOps)

philippkahr commented 1 year ago

Instead of doing 5 GETs, could we merge those into a single MGET?

pmuellr commented 1 year ago

It's not clear to me who's making these calls. I believe that [{space}:]config document is part of the "Advanced Settings". Which rules don't use directly. The ES Query rule type can use data views, which is I'm guessing how this leaked in.

I think we're going to have to step through the code in alerting, with some instrumentation on those ES GET calls (I guess somewhere in UI settings) to see if we can figure out where these are coming from.

It's not clear to me that we realized this was a UI Settings thing - which seems very odd to me - my first thought that this was related to all the other seemingly duplicate calls we make. These ones? Dunno.

So, I'm going to put this back into the triage bucket, for our next session. I think someone should do some time-boxed investigation to try to find out why these calls are being made - we can then try to figure out how to fix this as the next work item.