Closed jpountz closed 1 year ago
Pinging @elastic/es-data-management (Team:Data Management)
@nik9000 pointed out that some of the work happening on synthetic _source
and metrics indices will bring down disk usage to 60 bytes per document. With 50GB primary shards, this would be 895M docs. A good way to think of the performance of aggregation queries is that it takes about 1s of CPU time per 10M docs. So such a shard would take 1min 30s to aggregate all its data. It would be disappointing if this space efficiency improvement resulted in slower query performance because some optimizations are no longer applicable and because searches are less parallelizable. In contrast, if we added max_primary_shard_docs: 200M
to the rollover configuration, indices would rollover at ~11GB, but aggregating all the data in a shard in a single thread would take ~20s.
In my view, this change is one of several changes that we will need to consider to make search performance more predictable and/or better, including also concurrent shard searches, rollups, query-time sampling, and maybe figuring out better index-time sorts.
@jpountz thanks for adding the analysis; it sounds like this would be be a good change to make then, since we are getting documents to be smaller and smaller as we go.
We discussed this as a team today and decided that we're generally +1 on adding it. Our only concern was adding this field to our default policies when the ILM UI in Kibana doesn't support it (so it would be invisible there), so I've opened https://github.com/elastic/kibana/issues/135088 to hopefully get that part resolved as well.
Thanks a lot for checking the above thinking and figuring out dependencies.
I'll wait some more time before making the change to get feedback on the number of docs for rollovers as well. I picked 200M decause ~250 bytes per document is a number I observed on a Nginx access log dataset, ie. a Logging dataset that has relatively small docs, but I'm open to other ideas. In any case, it should be something that we could change later on, this wouldn't be a breaking change.
I think this makes sense but I'm not sure if it hits the root cause for why some of the observability dashboards are slow in certain scenarios.
What I do agree with is that worst-case performance (aggregating data of a whole shard) is something that's definitely relevant. Not sure if the best way to tackle it is to make the shards smaller (this proposal) or to enable concurrent searching within a shard. IINM, the ESQL query engine can do concurrent searches on a single shard. Is that something we can expect for the current query engine in the future, too? Will the query engines converge?
I think the more fundamental problem for slow loading dashboards is that we're doing aggregations over too many documents. Especially if the cardinality of a time series dimension is relatively high (like 100s or 1000s of containers), and the time range gets larger (1w, 1m, 1y), it's just infeasible to aggregate all the raw data points in an ad-hoc manner. Aggregating a billion documents to populate a graph that just uses 1000 data points (i.e. date_histogram buckets) doesn't seem sensible. This is probably not the best place to discuss this so I'll not go into more detail here.
IINM, the ESQL query engine can do concurrent searches on a single shard. Is that something we can expect for the current query engine in the future, too? Will the query engines converge?
You are right. There used to be resistance to shard-level concurrency, but I think that we have been changing minds over the past year. ESQL can indeed parallelize to the segment level, and across commands. Plus there have been some discussions to parallelize within a segment too. This would definitely help with this sort of use-case. Whether we will prioritize request parallelism for _search
is something that is unclear yet.
If you write your thoughts about aggregations over longer periods of time somewhere, I'll be interested in checking them out.
I think instead of rolling over by default when a primary shard hits a maximum number of document, we should enforce this in the rollover action / api. The number of documents in a shard are an important factor in query time for certain queries and aggregations. Also the memory usage of bitset based caches can grow linearly to the number of documents in a shard. There are also queries that create such bitsets at query execution to ensure Lucene doc ids are emitted in order. Finally we should never ever run into the maximum limit number of documents a Lucene index can hold. So I think for data streams / indices that we manage on the behalve of the user, we should guard for those stability issues and force a rollover if we reach a certain number of documents.
I think we can start to set this builtin limit to 200M
. But maybe overtime we may want to decrease this. A FixedBitSet
for 200M docs still takes ~24MB of heap space.
Thanks for creating https://github.com/elastic/elasticsearch/pull/94065, @martijnvg.
I think in addition to that, we should also change the defaults in the default policy for metrics-*-*
https://github.com/elastic/elasticsearch/blob/985a2fb9b74e72b9c91517f34d8712018ab0fb61/x-pack/plugin/core/src/main/resources/metrics-policy.json#L5-L8
Once we did that, the o11y team can make sure to add this as a default for all integrations.
@felixbarny That makes sense to me. I think we just need to add a max_primary_shard_docs
condition for tsdb, but lower than builtin 200M defined in #94065.
Based on a quick search, there are a couple of built-in policies:
Should we change all of them or is there a central place where we can set this? Maybe similar to https://github.com/elastic/elasticsearch/pull/94065 but applying that to all data streams, not just for TSDS.
Metric data streams appear to have the biggest risk of having a lot of documents once the data stream reaches 50gb (especially with synthetic source enabled). Unless adding a default for all data streams is imminent, I propose we add max_primary_shard_docs
to metrics-policy.json
.
Should we change all of them or is there a central place where we can set this? Maybe similar to https://github.com/elastic/elasticsearch/pull/94065 but applying that to all data streams, not just for TSDS.
https://github.com/elastic/elasticsearch/pull/94065 has evolved to roll over all data streams, not just TSDS, once they hit 200M documents 🎉 Thanks for driving this, @martijnvg!
I don't see a reason to change any of the built-in or solution-provided ILM policies then.
Hi @felixbarny @martijnvg Drive by comment (per an internal chat), it’s in the release notes but not in the ilm-rollover doc. Should we put it into the below doc too? https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-rollover.html
@rseldner ^ ^
Should we put it into the below doc too?
Yes, we should do this. Unfortunately I forgot to update the docs when making this change.
Description
Elasticsearch ships with some built-in ILM policies, e.g. there is one called
30-days-default
that looks like this:I would suggest adding a
max_primary_shard_docs
next to themax_primary_shard_size
. The actual value needs more discussions, but I'm thinking of something in the order of 200M in order to get a better search experience with space-efficient datasets. So datasets where documents take more than 50GB/200M = 268 bytes would rollover at 50GB while more space-efficient datasets would rollover at 200M documents. My motivation is the following:match_all
query. Bounding the number of docs per primary shard makes it a bit more likely that some shards fully match the query, and the fact that shards that partially match the range filter have a bounded number of docs also helps bound tail latencies.terms
orcomposite
aggregations that need to build global ordinals under the hood. Global ordinals need to be built for the entire shard even though the query might only match a tiny part of it. Bounding the number of docs in a shard also helps bound the time it takes to build global ordinals.