elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.6k stars 8.21k forks source link

Do not allow APM UI to query frozen tier #190559

Closed smith closed 1 month ago

smith commented 2 months ago

In a Cross Cluster Search (CCS) environment, it's possible for different clusters to serve different data tiers in responses.

If one of the requested clusters responds slowly with data from the frozen tier, this can cause a timeout at the proxy after 320s, and 502 responses presented as failure toast messages in the UI with no data loaded.

Proposed solution

Don't allow APM to query the frozen tier.

We can add {must_not: { term: { _tier: 'data_frozen' } } } query to all of our requests (in the APMEventClient).

Advanced setting

We need users to be able to exclude APM requests from the specified data tiers.

I asked @elastic/kibana-data-discovery about reusing the deprecated search:includeFrozen but it might be a better idea to create a new advanced setting that behaves the same as data_views:fields_excluded_data_tiers and securitySolution:excludedDataTiersForRuleExecution:

Exclude fields from specified tiers (such as data_frozen) for faster performance. Comma delimit to exclude multiple tiers - data_warm,data_cold

Not sure if this should be a Kibana-wide setting under Search or Observability-specific. So search:search_excluded_data_tiers or observability:search_excluded_data_tiers.

In the case of APM, all requests use APMEventClient. I assume most Observability solution plugins have a centralized place where all _search queries can be modified with one code change. It would be ok to call the setting observability and not immediately update all the non-APM plugins, but if we don't fix them all we should make follow up issues for the respective teams.

### My tasks
- [ ] https://github.com/elastic/kibana/pull/192276
- [ ] https://github.com/elastic/kibana/pull/192570
- [ ] https://github.com/elastic/kibana/pull/192373

Acceptance criteria

elasticmachine commented 2 months ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

lucabelluccini commented 2 months ago

It is worth mentioning it might be interesting to discuss this with Security Solutions. Maybe there's some convergence we can put in place?

Also, we need to think about alerting. We need to make sure the UI and Alerting are coherent with the setting.

crespocarlos commented 1 month ago

@lucabelluccini , security solution has already something for this: securitySolution:excludedDataTiersForRuleExecution. They use it in some situations. Do you know if the have experienced the same problems we have on APM?

Also, we need to think about alerting. We need to make sure the UI and Alerting are coherent with the setting.

The plan is to use the setting everywhere we run queries in APM, Infra and make it available to the rest of obs solutions

lucabelluccini commented 1 month ago

Hello @crespocarlos feel free to reach me out privately for details, but you'll see a linked private real situation linked to this public issue.

crespocarlos commented 1 month ago

@smith @lucabelluccini I've spoken with Security Solution folks and we agreed to create a Kibana-wide setting.

They've also experienced the same problem (CPU spikes and cold/frozen being hit when not desired) in the past :https://github.com/elastic/kibana/pull/186908. This setting could also benefit @elastic/stack-monitoring, as it has recently had SDH issues caused by queries hitting the frozen tier.

consulthys commented 1 month ago

This setting could also benefit @elastic/stack-monitoring, as it has recently had SDH issues caused by queries hitting the frozen tier.

Absolutely, this coming up in a few different places around Stack Monitoring, most notably this one where we are thinking of doing something similar, but only for a specific set of shard queries that don't have time range constraints.

crespocarlos commented 1 month ago

I had a chat with @elastic/kibana-data-discovery, and they advised against implementing a Kibana-wide setting. One concern is the potential for confusion. For example:

A general exclusion could result in data not appearing in Discover or Dashboard without any clear explanation, especially when users expand the time range to find historical data.

Basically, exclusions should be analyzed case by case. @elastic/kibana-data-discovery will continue discussing a unified approach, but for now, we'll proceed with an O11y-specific setting.

lucabelluccini commented 3 days ago

A pair of questions if I may:

crespocarlos commented 3 days ago

Hi @lucabelluccini

Will this tier filter setting be used also by APM, Alerting, SLOs and Synthetics

The new setting could be used by other O11y apps.

If yes, do we need follow up work on each application within O11y to address it?

Unfortunately, there isn’t isn't a centralized elasticsearch client usage that would allow a setting like this to be applied across O11y without additional effort, so, yes, we'd need follow up work. Preferably, aiming to make O11y solutions to consume a single elasticsearch client wrapper instead of having each application implementing it in their own way.

Besides, even within applications that may use this setting, it won't affect queries performed by platform components that use bsearch out of the box, such as Lens, because platform also has its own way to consume the elasticsearch client.

lucabelluccini commented 3 days ago

Makes sense - thanks for clarifying