Investigate timeout issue and use of time range in stack monitoring queries

There is a reported issue of timeouts while using Stack Monitoring After some investigation we saw that in Stack Monitoring we have queries without a date range filter:

getClustersState

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*/_search?ignore_unavailable=true&filter_path=hits.hits._source.cluster_uuid,hits.hits._source.elasticsearch.cluster.id,hits.hits._source.cluster_state,hits.hits._source.elasticsearch.cluster.stats.state"
{
    "query": {
      "bool": {
        "filter": [
          {
            "term": {
              "type": "cluster_state"
            }
          },
          {
            "terms": {
              "cluster_uuid": [
                CLUSTER_UID
              ]
            }
          }
        ]
      }
    },
    "collapse": {
      "field": "cluster_uuid"
    },
    "sort": {
      "timestamp": {
        "order": "desc",
        "unmapped_type": "long"
      }
    }
}

getShardStats

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.shard-*/_search?ignore_unavailable=true
{
  "size": 0,
  "sort": {
    "timestamp": {
      "order": "desc",
      "unmapped_type": "long"
    }
  },
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "data_stream.dataset": "elasticsearch.stack_monitoring.shard"
                }
              },
              {
                "term": {
                  "metricset.name": "shard"
                }
              },
              {
                "term": {
                  "type": "shards"
                }
              }
            ]
          }
        },
        {
          "term": {
            "cluster_uuid": [CLUSTER_UUID]
          }
        },
        {
          "term": {
            "elasticsearch.cluster.stats.state.state_uuid": [CLUSTER_STATE_UUID]
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "shard.node": [SHARD_NODE]
                }
              },
              {
                "term": {
                  "elasticsearch.node.id":[NODE_ID]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "shard.index",
        "size": 10000
      },
      "aggs": {
        "states": {
          "terms": {
            "field": "shard.state",
            "size": 10
          },
          "aggs": {
            "primary": {
              "terms": {
                "field": "shard.primary",
                "size": 2
              }
            }
          }
        }
      }
    },
    "nodes": {
      "terms": {
        "field": "shard.node",
        "size": 10000
      },
      "aggs": {
        "index_count": {
          "cardinality": {
            "field": "shard.index"
          }
        },
        "node_names": {
          "terms": {
            "field": "source_node.name",
            "size": 10
          }
        },
        "node_ids": {
          "terms": {
            "field": "source_node.uuid",
            "size": 1
          }
        }
      }
    }
  }
}

loading indices page results in timeout loading machine learning page results in timeout

Both run this query

getUnassignedShardData

GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.shard-*/_search?ignore_unavailable=true
{
  "size": 0,
  "sort": {
    "timestamp": {
      "order": "desc",
      "unmapped_type": "long"
    }
  },
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "data_stream.dataset": "elasticsearch.stack_monitoring.shard"
                }
              },
              {
                "term": {
                  "metricset.name": "shard"
                }
              },
              {
                "term": {
                  "type": "shards"
                }
              }
            ]
          }
        },
        {
          "term": {
            "cluster_uuid": [CLUSTER_UUID]
          }
        },
        {
          "term": {
            "elasticsearch.cluster.stats.state.state_uuid":  [CLUSTER_STATE_UUID]
          }
        }
      ]
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "shard.index",
        "size": 10000
      },
      "aggs": {
        "state": {
          "filter": {
            "terms": {
              "shard.state": [
                "UNASSIGNED",
                "INITIALIZING"
              ]
            }
          },
          "aggs": {
            "primary": {
              "terms": {
                "field": "shard.primary",
                "size": 2
              }
            }
          }
        }
      }
    }
  }
}

This idea here is to investigate how to improve the queries and possibly include a time range while maintaining the same functionality.

Regarding the `getClustersState` query

That query always returns an empty result set, as there are no documents in the monitoring indices (.monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*) with "type": "cluster_state". This is confirmed with the queries run by the customer and shared by @louisong in his comment for both custom and super user (see Query 1 in those shared files and also provided below). Even if they return nothing, it would still be interesting to know the took time of those two queries.

Query 1

``` Query 1 =============================================================================================== GET .monitoring-es-*,metrics-elasticsearch.stack_monitoring.*-*/_search?ignore_unavailable=true&filter_path=hits.hits._source.cluster_uuid,hits.hits._source.elasticsearch.cluster.id,hits.hits._source.cluster_state,hits.hits._source.elasticsearch.cluster.stats.state" { "query": { "bool": { "filter": [ { "term": { "type": "cluster_state" } }, { "terms": { "cluster_uuid": [ "GkDDZY7mT42RyVatNmEnbA" ] } } ] } }, "collapse": { "field": "cluster_uuid" }, "sort": { "timestamp": { "order": "desc", "unmapped_type": "long" } } } Response =============================================================================================== {} - no output ```

=> I think this query can be ruled out and we should probably not focus on it.

Regarding the `getShardStats` and `getUnassignedShardData` queries

As stated by @klacabane here, the shard documents are "siloed" by cluster state. In order to have a true picture of the assigned/unassigned shards, the queries need to query all shards for a given cluster state and that's what is done with the "elasticsearch.cluster.stats.state.state_uuid": [CLUSTER_STATE_UUID] contraint in both queries. That's semantically equivalent to providing the time range [last_cluster_state_change_time, now]

I assume that the latest cluster state_uuid is used in the two shard queries. Even if the shard documents that are retrieved are most certainly "recent" (i.e. from the latest cluster state), the lack of a time range constraint might indeed prevent to skip frozen shards.

A word of caution here that adding a time range in these queries might alter the current behavior since not all clusters change their state at the same pace. That being said, it could probably help to add a reasonable time range (e.g. last 10 days) in order to increase the odds of leveraging the pre-filtering phase and skip frozen shards. However, if the cluster state hasn't changed during that period, the result would be empty. Maybe, in a second iteration, this time range should even be configurable to cater for this possibility.

And since we're looking at improving performance, it could also make sense to remove the sort clause since it doesn't make sense in an aggregation query with size: 0.

Before attempting anything here, I'm going to ask support to have the customer re-run those shard queries with an additional 10-days-ish time range in order to see if that helps at all.

Circling back to this after having discussed with support, we could see that adding a time range to those queries showed that they run much faster and they do not hit the frozen tier.

Knowing that they execute fast when they do not hit the frozen tier, we might not even have to add a time frame to these two queries, since it should be impossible for the shard metric set documents that are part of the latest cluster state to be in any other tier than the hot tier. As a result, we could maybe leverage the _tier metadata field and only query the hot tier, which is what I'm going to ask support to try.

Adding a constraint on the _tier metadata field proves to be even more optimal than adding a time range (which would be difficult to set given that ILMs can be configured very differently from user to user).

A quick summary of how much the query took time decreased when using a time range and a tier affinity can be found below.

Query	User	Using _tier	Using time range	No constraint
getShardStats	User with DLS	1819 ms	2066 ms	12009 ms
getShardStats	Superuser (no DLS)	1561 ms	5881 ms	38850 ms
getUnassignedShardData	User with DLS	1445 ms	1423 ms	9992 ms
getUnassignedShardData	Superuser (no DLS)	1529 ms	2603 ms	7725 ms

If we would like to pursue this, we see two options:

only query "_tier": "data_hot"
similarly to what's being done for APM, simply exclude "_tier": "data_frozen"

elastic / kibana