Open jennypavlova opened 1 month ago
getClustersState
queryThat query always returns an empty result set, as there are no documents in the monitoring indices (.monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*
) with "type": "cluster_state"
. This is confirmed with the queries run by the customer and shared by @louisong in his comment for both custom and super user (see Query 1
in those shared files and also provided below). Even if they return nothing, it would still be interesting to know the took time of those two queries.
``` Query 1 =============================================================================================== GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*/_search?ignore_unavailable=true&filter_path=hits.hits._source.cluster_uuid,hits.hits._source.elasticsearch.cluster.id,hits.hits._source.cluster_state,hits.hits._source.elasticsearch.cluster.stats.state" { "query": { "bool": { "filter": [ { "term": { "type": "cluster_state" } }, { "terms": { "cluster_uuid": [ "GkDDZY7mT42RyVatNmEnbA" ] } } ] } }, "collapse": { "field": "cluster_uuid" }, "sort": { "timestamp": { "order": "desc", "unmapped_type": "long" } } } Response =============================================================================================== {} - no output ```
=> I think this query can be ruled out and we should probably not focus on it.
getShardStats
and getUnassignedShardData
queriesAs stated by @klacabane here, the shard
documents are "siloed" by cluster state. In order to have a true picture of the assigned/unassigned shards, the queries need to query all shards for a given cluster state and that's what is done with the "elasticsearch.cluster.stats.state.state_uuid": [CLUSTER_STATE_UUID]
contraint in both queries. That's semantically equivalent to providing the time range [last_cluster_state_change_time, now]
I assume that the latest cluster state_uuid
is used in the two shard queries. Even if the shard
documents that are retrieved are most certainly "recent" (i.e. from the latest cluster state), the lack of a time range constraint might indeed prevent to skip frozen shards.
A word of caution here that adding a time range in these queries might alter the current behavior since not all clusters change their state at the same pace. That being said, it could probably help to add a reasonable time range (e.g. last 10 days) in order to increase the odds of leveraging the pre-filtering phase and skip frozen shards. However, if the cluster state hasn't changed during that period, the result would be empty. Maybe, in a second iteration, this time range should even be configurable to cater for this possibility.
And since we're looking at improving performance, it could also make sense to remove the sort
clause since it doesn't make sense in an aggregation query with size: 0
.
Before attempting anything here, I'm going to ask support to have the customer re-run those shard
queries with an additional 10-days-ish time range in order to see if that helps at all.
Circling back to this after having discussed with support, we could see that adding a time range to those queries showed that they run much faster and they do not hit the frozen tier.
Knowing that they execute fast when they do not hit the frozen tier, we might not even have to add a time frame to these two queries, since it should be impossible for the shard
metric set documents that are part of the latest cluster state to be in any other tier than the hot tier. As a result, we could maybe leverage the _tier
metadata field and only query the hot tier, which is what I'm going to ask support to try.
Adding a constraint on the _tier
metadata field proves to be even more optimal than adding a time range (which would be difficult to set given that ILMs can be configured very differently from user to user).
A quick summary of how much the query took
time decreased when using a time range and a tier affinity can be found below.
Query | User | Using _tier | Using time range | No constraint |
---|---|---|---|---|
getShardStats | User with DLS | 1819 ms | 2066 ms | 12009 ms |
getShardStats | Superuser (no DLS) | 1561 ms | 5881 ms | 38850 ms |
getUnassignedShardData | User with DLS | 1445 ms | 1423 ms | 9992 ms |
getUnassignedShardData | Superuser (no DLS) | 1529 ms | 2603 ms | 7725 ms |
If we would like to pursue this, we see two options:
"_tier": "data_hot"
"_tier": "data_frozen"
Related to https://github.com/elastic/sdh-elasticsearch/issues/8151
There is a reported issue of timeouts while using Stack Monitoring After some investigation we saw that in Stack Monitoring we have queries without a date range filter:
getClustersState
getShardStats
Both run this query
getUnassignedShardData
This idea here is to investigate how to improve the queries and possibly include a time range while maintaining the same functionality.