Multi range traversal for numeric range aggregations

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.39k stars 950 forks source link

Multi range traversal for numeric range aggregations #13335

Open jainankitk opened 2 weeks ago

jainankitk commented 2 weeks ago

Description

In OpenSearch, we have introduced multi range traversal for collecting matching document count in single tree traversal (https://github.com/opensearch-project/OpenSearch/pull/13317). That has helped improve the performance of numeric aggregations in OpenSearch significantly. I am wondering if there are other use cases that can benefit from this and change should be included in Lucene.

Constraints:

The ranges are non-overlapping and in increasing order. For example - In (a1,b1),(a2,b2)...(an,bn), it is assumed ai<aj for all i<j and bi < a(i+1) for every i
Field is 1 dimensional guaranteeing the points are stored in increasing order, and segment does not have any deletions

jainankitk commented 2 weeks ago

Looks related to #9814, but there are differences between the two. As discussed offline with @bowenlan and @rishabhmaurya, MultiRangeQuery only tells what matches with your multi ranges, not what matches with each of your multi ranges.

jainankitk commented 4 days ago

@jpountz - Thoughts?

mikemccand commented 3 days ago

I like this optimization. Maybe it best fits in Lucene's facet module? But, I don't think our facet impls today ever use points, directly, to do counting/aggregation -- it's a two step process of first collecting into a bitset holding the matched docs, and, second, iterating those docs and looking doc values or facet ords (also from doc values) and counting/aggregating from there. But in the browse-only cases where a query just wants counts of ranges across all docs in the index, this opto should be a crazy fast way to achieve it when there are no deletions. Even when there are deletions, this opto could visit all docs and check the live docs and count/aggregate accordingly? The time is no longer sub-linear, but it'd still be faster than the two phased approach that Lucene's facets use today?

@stefanvodita / @Shradha26 WDYT?

stefanvodita commented 1 day ago

It's true that we have this two-step process for aggregations (incl. counts) and that it's not always the most efficient solution. +1 to try out this optimisation, sounds promising!