Closed smith closed 1 month ago
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)
It is worth mentioning it might be interesting to discuss this with Security Solutions. Maybe there's some convergence we can put in place?
Also, we need to think about alerting. We need to make sure the UI and Alerting are coherent with the setting.
@lucabelluccini , security solution has already something for this: securitySolution:excludedDataTiersForRuleExecution
. They use it in some situations. Do you know if the have experienced the same problems we have on APM?
Also, we need to think about alerting. We need to make sure the UI and Alerting are coherent with the setting.
The plan is to use the setting everywhere we run queries in APM, Infra and make it available to the rest of obs solutions
Hello @crespocarlos feel free to reach me out privately for details, but you'll see a linked private real situation linked to this public issue.
@smith @lucabelluccini I've spoken with Security Solution folks and we agreed to create a Kibana-wide setting.
They've also experienced the same problem (CPU spikes and cold/frozen being hit when not desired) in the past :https://github.com/elastic/kibana/pull/186908. This setting could also benefit @elastic/stack-monitoring, as it has recently had SDH issues caused by queries hitting the frozen tier.
This setting could also benefit @elastic/stack-monitoring, as it has recently had SDH issues caused by queries hitting the frozen tier.
Absolutely, this coming up in a few different places around Stack Monitoring, most notably this one where we are thinking of doing something similar, but only for a specific set of shard
queries that don't have time range constraints.
I had a chat with @elastic/kibana-data-discovery, and they advised against implementing a Kibana-wide setting. One concern is the potential for confusion. For example:
A general exclusion could result in data not appearing in Discover or Dashboard without any clear explanation, especially when users expand the time range to find historical data.
Basically, exclusions should be analyzed case by case. @elastic/kibana-data-discovery will continue discussing a unified approach, but for now, we'll proceed with an O11y-specific setting.
A pair of questions if I may:
Hi @lucabelluccini
Will this tier filter setting be used also by APM, Alerting, SLOs and Synthetics
The new setting could be used by other O11y apps.
If yes, do we need follow up work on each application within O11y to address it?
Unfortunately, there isn’t isn't a centralized elasticsearch client usage that would allow a setting like this to be applied across O11y without additional effort, so, yes, we'd need follow up work. Preferably, aiming to make O11y solutions to consume a single elasticsearch client wrapper instead of having each application implementing it in their own way.
Besides, even within applications that may use this setting, it won't affect queries performed by platform components that use bsearch
out of the box, such as Lens, because platform also has its own way to consume the elasticsearch client.
Makes sense - thanks for clarifying
In a Cross Cluster Search (CCS) environment, it's possible for different clusters to serve different data tiers in responses.
If one of the requested clusters responds slowly with data from the frozen tier, this can cause a timeout at the proxy after 320s, and 502 responses presented as failure toast messages in the UI with no data loaded.
Proposed solution
Don't allow APM to query the frozen tier.
We can add
{must_not: { term: { _tier: 'data_frozen' } } }
query to all of our requests (in the APMEventClient).Advanced setting
We need users to be able to exclude APM requests from the specified data tiers.
I asked @elastic/kibana-data-discovery about reusing the deprecated
search:includeFrozen
but it might be a better idea to create a new advanced setting that behaves the same asdata_views:fields_excluded_data_tiers
andsecuritySolution:excludedDataTiersForRuleExecution
:Not sure if this should be a Kibana-wide setting under Search or Observability-specific. So
search:search_excluded_data_tiers
orobservability:search_excluded_data_tiers
.In the case of APM, all requests use APMEventClient. I assume most Observability solution plugins have a centralized place where all _search queries can be modified with one code change. It would be ok to call the setting
observability
and not immediately update all the non-APM plugins, but if we don't fix them all we should make follow up issues for the respective teams.Acceptance criteria