Add a `max_primary_shard_docs` to default ILM policies

jpountz commented 2 years ago

Description

Elasticsearch ships with some built-in ILM policies, e.g. there is one called 30-days-default that looks like this:

{
  "phases": {
    "hot": {
      "actions": {
        "rollover": {
          "max_primary_shard_size": "50gb",
          "max_age": "30d"
        }
      }
    },
    "warm": {
      "min_age": "2d",
      "actions": {
        "shrink": {
          "number_of_shards": 1
        },
        "forcemerge": {
          "max_num_segments": 1
        }
      }
    },
    "delete": {
      "min_age": "30d",
      "actions":{
        "delete": {}
      }
    }
  },
  "_meta": {
    "description": "built-in ILM policy using the hot and warm phases with a retention of 30 days",
    "managed": true
  }
}

I would suggest adding a max_primary_shard_docs next to the max_primary_shard_size. The actual value needs more discussions, but I'm thinking of something in the order of 200M in order to get a better search experience with space-efficient datasets. So datasets where documents take more than 50GB/200M = 268 bytes would rollover at 50GB while more space-efficient datasets would rollover at 200M documents. My motivation is the following:

While shards could rollover before reaching 50GB, they would still not be small: 200M docs is not a small number of documents for a shard. As a data point, shards have a hard limit of 2B docs.
Search performance depends more directly on the number of docs than on byte size, so having more control on the number of docs of a shard would give stronger guarantees on the sort of worst-case latency that shards can provide. This is especially relevant given recent/ongoing efforts to improve the space efficiency of Elasticsearch with runtime fields, doc-value-only fields or synthetic source.
Aggregations have optimizations for the case when the range filter rewrites to a match_all query. Bounding the number of docs per primary shard makes it a bit more likely that some shards fully match the query, and the fact that shards that partially match the range filter have a bounded number of docs also helps bound tail latencies.
Rollups can only operate or a rolled-over index, so users cannot enjoy the query speedup of rollups until the primary shard size is reached. Bounding the size of primary shards in a way that makes it easier to reason about query performance helps there too.
Many use-cases for Elasticsearch involve terms or composite aggregations that need to build global ordinals under the hood. Global ordinals need to be built for the entire shard even though the query might only match a tiny part of it. Bounding the number of docs in a shard also helps bound the time it takes to build global ordinals.

elasticmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)

jpountz commented 2 years ago

@nik9000 pointed out that some of the work happening on synthetic _source and metrics indices will bring down disk usage to 60 bytes per document. With 50GB primary shards, this would be 895M docs. A good way to think of the performance of aggregation queries is that it takes about 1s of CPU time per 10M docs. So such a shard would take 1min 30s to aggregate all its data. It would be disappointing if this space efficiency improvement resulted in slower query performance because some optimizations are no longer applicable and because searches are less parallelizable. In contrast, if we added max_primary_shard_docs: 200M to the rollover configuration, indices would rollover at ~11GB, but aggregating all the data in a shard in a single thread would take ~20s.

In my view, this change is one of several changes that we will need to consider to make search performance more predictable and/or better, including also concurrent shard searches, rollups, query-time sampling, and maybe figuring out better index-time sorts.

dakrone commented 2 years ago

@jpountz thanks for adding the analysis; it sounds like this would be be a good change to make then, since we are getting documents to be smaller and smaller as we go.

dakrone commented 2 years ago

We discussed this as a team today and decided that we're generally +1 on adding it. Our only concern was adding this field to our default policies when the ILM UI in Kibana doesn't support it (so it would be invisible there), so I've opened https://github.com/elastic/kibana/issues/135088 to hopefully get that part resolved as well.

jpountz commented 2 years ago

Thanks a lot for checking the above thinking and figuring out dependencies.

I'll wait some more time before making the change to get feedback on the number of docs for rollovers as well. I picked 200M decause ~250 bytes per document is a number I observed on a Nginx access log dataset, ie. a Logging dataset that has relatively small docs, but I'm open to other ideas. In any case, it should be something that we could change later on, this wouldn't be a breaking change.

felixbarny commented 2 years ago

I think this makes sense but I'm not sure if it hits the root cause for why some of the observability dashboards are slow in certain scenarios.

What I do agree with is that worst-case performance (aggregating data of a whole shard) is something that's definitely relevant. Not sure if the best way to tackle it is to make the shards smaller (this proposal) or to enable concurrent searching within a shard. IINM, the ESQL query engine can do concurrent searches on a single shard. Is that something we can expect for the current query engine in the future, too? Will the query engines converge?

I think the more fundamental problem for slow loading dashboards is that we're doing aggregations over too many documents. Especially if the cardinality of a time series dimension is relatively high (like 100s or 1000s of containers), and the time range gets larger (1w, 1m, 1y), it's just infeasible to aggregate all the raw data points in an ad-hoc manner. Aggregating a billion documents to populate a graph that just uses 1000 data points (i.e. date_histogram buckets) doesn't seem sensible. This is probably not the best place to discuss this so I'll not go into more detail here.

jpountz commented 2 years ago

IINM, the ESQL query engine can do concurrent searches on a single shard. Is that something we can expect for the current query engine in the future, too? Will the query engines converge?

You are right. There used to be resistance to shard-level concurrency, but I think that we have been changing minds over the past year. ESQL can indeed parallelize to the segment level, and across commands. Plus there have been some discussions to parallelize within a segment too. This would definitely help with this sort of use-case. Whether we will prioritize request parallelism for _search is something that is unclear yet.

If you write your thoughts about aggregations over longer periods of time somewhere, I'll be interested in checking them out.

martijnvg commented 1 year ago

I think instead of rolling over by default when a primary shard hits a maximum number of document, we should enforce this in the rollover action / api. The number of documents in a shard are an important factor in query time for certain queries and aggregations. Also the memory usage of bitset based caches can grow linearly to the number of documents in a shard. There are also queries that create such bitsets at query execution to ensure Lucene doc ids are emitted in order. Finally we should never ever run into the maximum limit number of documents a Lucene index can hold. So I think for data streams / indices that we manage on the behalve of the user, we should guard for those stability issues and force a rollover if we reach a certain number of documents.

I think we can start to set this builtin limit to 200M. But maybe overtime we may want to decrease this. A FixedBitSet for 200M docs still takes ~24MB of heap space.

felixbarny commented 1 year ago

Thanks for creating https://github.com/elastic/elasticsearch/pull/94065, @martijnvg.

I think in addition to that, we should also change the defaults in the default policy for metrics-*-* https://github.com/elastic/elasticsearch/blob/985a2fb9b74e72b9c91517f34d8712018ab0fb61/x-pack/plugin/core/src/main/resources/metrics-policy.json#L5-L8

Once we did that, the o11y team can make sure to add this as a default for all integrations.

martijnvg commented 1 year ago

@felixbarny That makes sense to me. I think we just need to add a max_primary_shard_docs condition for tsdb, but lower than builtin 200M defined in #94065.

felixbarny commented 1 year ago

Based on a quick search, there are a couple of built-in policies:

fleet-actions-results-ilm-policy.json
fleet-file-data-ilm-policy.json
fleet-files-ilm-policy.json
ilm-history-ilm-policy.json
logs-policy.json
metrics-policy.json
monitoring-mb-ilm-policy.json
slm-history-ilm-policy.json
synthetics-policy.json
watch-history-ilm-policy.json
size_based_ilm_policy.json
deprecation-indexing-ilm-policy.json

Should we change all of them or is there a central place where we can set this? Maybe similar to https://github.com/elastic/elasticsearch/pull/94065 but applying that to all data streams, not just for TSDS.

Metric data streams appear to have the biggest risk of having a lot of documents once the data stream reaches 50gb (especially with synthetic source enabled). Unless adding a default for all data streams is imminent, I propose we add max_primary_shard_docs to metrics-policy.json.

felixbarny commented 1 year ago

Should we change all of them or is there a central place where we can set this? Maybe similar to https://github.com/elastic/elasticsearch/pull/94065 but applying that to all data streams, not just for TSDS.

https://github.com/elastic/elasticsearch/pull/94065 has evolved to roll over all data streams, not just TSDS, once they hit 200M documents 🎉 Thanks for driving this, @martijnvg!

I don't see a reason to change any of the built-in or solution-provided ILM policies then.

kunisen commented 1 year ago

Hi @felixbarny @martijnvg Drive by comment (per an internal chat), it’s in the release notes but not in the ilm-rollover doc. Should we put it into the below doc too? https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-rollover.html

@rseldner ^ ^

martijnvg commented 1 year ago

Should we put it into the below doc too?

Yes, we should do this. Unfortunately I forgot to update the docs when making this change.

elastic / elasticsearch

Add a `max_primary_shard_docs` to default ILM policies #87246

Description