elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.21k stars 24.85k forks source link

Dynamic pruning for aggregations? #85759

Open jpountz opened 2 years ago

jpountz commented 2 years ago

Description

Some queries do not need to look at all matches to return results: while matches are being collected, they can take advantage of the hits that have been collected so far to skip over uninteresting matches. This is what Elasticsearch does for sorted queries for instance, if you want to get the top 100 most recent hits by descending @timestamp and the 100th hit you collected so far has a timestamp of 2022-04-08, then you know you only need to look at hits whole timestamp is greater than or equal to this value, and it's easy to narrow down the set of documents to evaluate thanks to index structures.

Dynamic pruning has the potential of yielding massive speedups. See for instance annotation DD on Lucene's nightly benchmarks for queries sorted by date, dynamic pruning made that query 7.5x faster.

Aggregations have some logic to short-circuit collection entirely already. This is even more efficient than dynamic pruning but has the downside of usually only being applicable to very specific queries, e.g. queries that rewrite to a match_all. On the other hand, dynamic pruning applies to any query.

This made me want to think more of whether dynamic pruning could be applied to aggregations:

TSDB is also going to introduce more opportunities for dynamic pruning thanks to its index sort. For instance, cardinality aggregations on dimensions could leverage dynamic pruning regardless of the field's cardinality (https://github.com/elastic/elasticsearch/issues/85523) and top_metrics by descending @timestamp field could perform even better thanks to the secondary sort on @timestamp: getting the last sample for each timeseries would be very cheap.

Maybe we could also think of aggregations that we do not have that could be interesting mostly because they could leverage dynamic pruning. For instance, listing unique values of a field, which is like terms without the doc_count. Such an aggregation could implement dynamic pruning for the same reasons that the cardinality aggregation could, and could help answer questions like "What are all the hosts that sent data within the last hour" very efficiently.

elasticmachine commented 2 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)