elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.71k stars 8.13k forks source link

Investigate timeout issue and use of time range in stack monitoring queries #189728

Open jennypavlova opened 1 month ago

jennypavlova commented 1 month ago

Related to https://github.com/elastic/sdh-elasticsearch/issues/8151

There is a reported issue of timeouts while using Stack Monitoring After some investigation we saw that in Stack Monitoring we have queries without a date range filter:

getClustersState

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*/_search?ignore_unavailable=true&filter_path=hits.hits._source.cluster_uuid,hits.hits._source.elasticsearch.cluster.id,hits.hits._source.cluster_state,hits.hits._source.elasticsearch.cluster.stats.state"
{
    "query": {
      "bool": {
        "filter": [
          {
            "term": {
              "type": "cluster_state"
            }
          },
          {
            "terms": {
              "cluster_uuid": [
                CLUSTER_UID
              ]
            }
          }
        ]
      }
    },
    "collapse": {
      "field": "cluster_uuid"
    },
    "sort": {
      "timestamp": {
        "order": "desc",
        "unmapped_type": "long"
      }
    }
}

getShardStats

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.shard-*/_search?ignore_unavailable=true
{
  "size": 0,
  "sort": {
    "timestamp": {
      "order": "desc",
      "unmapped_type": "long"
    }
  },
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "data_stream.dataset": "elasticsearch.stack_monitoring.shard"
                }
              },
              {
                "term": {
                  "metricset.name": "shard"
                }
              },
              {
                "term": {
                  "type": "shards"
                }
              }
            ]
          }
        },
        {
          "term": {
            "cluster_uuid": [CLUSTER_UUID]
          }
        },
        {
          "term": {
            "elasticsearch.cluster.stats.state.state_uuid": [CLUSTER_STATE_UUID]
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "shard.node": [SHARD_NODE]
                }
              },
              {
                "term": {
                  "elasticsearch.node.id":[NODE_ID]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "shard.index",
        "size": 10000
      },
      "aggs": {
        "states": {
          "terms": {
            "field": "shard.state",
            "size": 10
          },
          "aggs": {
            "primary": {
              "terms": {
                "field": "shard.primary",
                "size": 2
              }
            }
          }
        }
      }
    },
    "nodes": {
      "terms": {
        "field": "shard.node",
        "size": 10000
      },
      "aggs": {
        "index_count": {
          "cardinality": {
            "field": "shard.index"
          }
        },
        "node_names": {
          "terms": {
            "field": "source_node.name",
            "size": 10
          }
        },
        "node_ids": {
          "terms": {
            "field": "source_node.uuid",
            "size": 1
          }
        }
      }
    }
  }
}

loading indices page results in timeout loading machine learning page results in timeout

Both run this query

getUnassignedShardData

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.shard-*/_search?ignore_unavailable=true
{
  "size": 0,
  "sort": {
    "timestamp": {
      "order": "desc",
      "unmapped_type": "long"
    }
  },
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "data_stream.dataset": "elasticsearch.stack_monitoring.shard"
                }
              },
              {
                "term": {
                  "metricset.name": "shard"
                }
              },
              {
                "term": {
                  "type": "shards"
                }
              }
            ]
          }
        },
        {
          "term": {
            "cluster_uuid": [CLUSTER_UUID]
          }
        },
        {
          "term": {
            "elasticsearch.cluster.stats.state.state_uuid":  [CLUSTER_STATE_UUID]
          }
        }
      ]
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "shard.index",
        "size": 10000
      },
      "aggs": {
        "state": {
          "filter": {
            "terms": {
              "shard.state": [
                "UNASSIGNED",
                "INITIALIZING"
              ]
            }
          },
          "aggs": {
            "primary": {
              "terms": {
                "field": "shard.primary",
                "size": 2
              }
            }
          }
        }
      }
    }
  }
}

This idea here is to investigate how to improve the queries and possibly include a time range while maintaining the same functionality.

consulthys commented 3 weeks ago

Regarding the getClustersState query

That query always returns an empty result set, as there are no documents in the monitoring indices (.monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*) with "type": "cluster_state". This is confirmed with the queries run by the customer and shared by @louisong in his comment for both custom and super user (see Query 1 in those shared files and also provided below). Even if they return nothing, it would still be interesting to know the took time of those two queries.

Query 1

``` Query 1 =============================================================================================== GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*/_search?ignore_unavailable=true&filter_path=hits.hits._source.cluster_uuid,hits.hits._source.elasticsearch.cluster.id,hits.hits._source.cluster_state,hits.hits._source.elasticsearch.cluster.stats.state" { "query": { "bool": { "filter": [ { "term": { "type": "cluster_state" } }, { "terms": { "cluster_uuid": [ "GkDDZY7mT42RyVatNmEnbA" ] } } ] } }, "collapse": { "field": "cluster_uuid" }, "sort": { "timestamp": { "order": "desc", "unmapped_type": "long" } } } Response =============================================================================================== {} - no output ```

=> I think this query can be ruled out and we should probably not focus on it.

Regarding the getShardStats and getUnassignedShardData queries

As stated by @klacabane here, the shard documents are "siloed" by cluster state. In order to have a true picture of the assigned/unassigned shards, the queries need to query all shards for a given cluster state and that's what is done with the "elasticsearch.cluster.stats.state.state_uuid": [CLUSTER_STATE_UUID] contraint in both queries. That's semantically equivalent to providing the time range [last_cluster_state_change_time, now]

I assume that the latest cluster state_uuid is used in the two shard queries. Even if the shard documents that are retrieved are most certainly "recent" (i.e. from the latest cluster state), the lack of a time range constraint might indeed prevent to skip frozen shards.

A word of caution here that adding a time range in these queries might alter the current behavior since not all clusters change their state at the same pace. That being said, it could probably help to add a reasonable time range (e.g. last 10 days) in order to increase the odds of leveraging the pre-filtering phase and skip frozen shards. However, if the cluster state hasn't changed during that period, the result would be empty. Maybe, in a second iteration, this time range should even be configurable to cater for this possibility.

And since we're looking at improving performance, it could also make sense to remove the sort clause since it doesn't make sense in an aggregation query with size: 0.

Before attempting anything here, I'm going to ask support to have the customer re-run those shard queries with an additional 10-days-ish time range in order to see if that helps at all.

consulthys commented 2 weeks ago

Circling back to this after having discussed with support, we could see that adding a time range to those queries showed that they run much faster and they do not hit the frozen tier.

Knowing that they execute fast when they do not hit the frozen tier, we might not even have to add a time frame to these two queries, since it should be impossible for the shard metric set documents that are part of the latest cluster state to be in any other tier than the hot tier. As a result, we could maybe leverage the _tier metadata field and only query the hot tier, which is what I'm going to ask support to try.

consulthys commented 1 week ago

Adding a constraint on the _tier metadata field proves to be even more optimal than adding a time range (which would be difficult to set given that ILMs can be configured very differently from user to user).

A quick summary of how much the query took time decreased when using a time range and a tier affinity can be found below.

 Query User  Using _tier Using time range No constraint
getShardStats User with DLS 1819 ms 2066 ms 12009 ms
getShardStats Superuser (no DLS) 1561 ms 5881 ms 38850 ms
getUnassignedShardData User with DLS 1445 ms 1423 ms 9992 ms
getUnassignedShardData Superuser (no DLS) 1529 ms 2603 ms 7725 ms

If we would like to pursue this, we see two options:

  1. only query "_tier": "data_hot"
  2. similarly to what's being done for APM, simply exclude "_tier": "data_frozen"