Dynamic pruning for aggregations?

Description

Some queries do not need to look at all matches to return results: while matches are being collected, they can take advantage of the hits that have been collected so far to skip over uninteresting matches. This is what Elasticsearch does for sorted queries for instance, if you want to get the top 100 most recent hits by descending @timestamp and the 100th hit you collected so far has a timestamp of 2022-04-08, then you know you only need to look at hits whole timestamp is greater than or equal to this value, and it's easy to narrow down the set of documents to evaluate thanks to index structures.

Dynamic pruning has the potential of yielding massive speedups. See for instance annotation DD on Lucene's nightly benchmarks for queries sorted by date, dynamic pruning made that query 7.5x faster.

Aggregations have some logic to short-circuit collection entirely already. This is even more efficient than dynamic pruning but has the downside of usually only being applicable to very specific queries, e.g. queries that rewrite to a match_all. On the other hand, dynamic pruning applies to any query.

This made me want to think more of whether dynamic pruning could be applied to aggregations:

min and max could leverage dynamic pruning, since they are very similar to sorted queries.
geo_bounds might be able to leverage dynamic pruning by filtering documents that are outside of the bounds that have been computed so far.
top_hits and top_metrics can leverage dynamic pruning similarly to sorted queries.
cardinality aggregations could skip all documents whose values have already been seen. In practice this might only work well on fields that have cardinalities in the order of tens though, otherwise the filter that matches all unseen values might be more costly than collecting all matches naively.

TSDB is also going to introduce more opportunities for dynamic pruning thanks to its index sort. For instance, cardinality aggregations on dimensions could leverage dynamic pruning regardless of the field's cardinality (https://github.com/elastic/elasticsearch/issues/85523) and top_metrics by descending @timestamp field could perform even better thanks to the secondary sort on @timestamp: getting the last sample for each timeseries would be very cheap.

Maybe we could also think of aggregations that we do not have that could be interesting mostly because they could leverage dynamic pruning. For instance, listing unique values of a field, which is like terms without the doc_count. Such an aggregation could implement dynamic pruning for the same reasons that the cardinality aggregation could, and could help answer questions like "What are all the hosts that sent data within the last hour" very efficiently.

elastic / elasticsearch

Dynamic pruning for aggregations? #85759

Description