apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.49k stars 986 forks source link

DisjunctionDISIApproximation / DisjunctionScorer can short-circuit #11922

Open gsmiller opened 1 year ago

gsmiller commented 1 year ago

Description

I believe we have an opportunity to improve disjunction evaluation by "short circuiting" within DisjunctionDISIApproximation / DisjunctionScorer. When a disjunction clause does not require scores, we don't need to actually advance all the postings to a common docID in all cases. With #advance(doc), we should be able to short-circuit as soon as we find one postings list that contains the target doc. With #next(), we can do the same thing if we treat the target as "current doc + 1". When confirming a second-phase match, we can also take advantage of this by lazily advancing as necessary until we find a match confirmation (in the worst case, if all two-phase checks fail, we'll still have to advance all the postings that "trail" the target doc).

Of course, any time scoring is provided by the disjunction, we have to advance everything, so this doesn't help there. But we've got some use-cases right now that use large disjunctions as filters, and it would be nice to short-circuit when possible.

I've got a prototype of this idea that I think is functionally correct and will put up a draft PR soon to better illustrate what I'm thinking. In the meantime, please poke holes in this idea if I'm totally off the mark or overlooking something important :)

gsmiller commented 1 year ago

I put up an initial PR of this idea here #11928. It can likely be further optimized for the two-phase case, but I wanted to start simple and see if there are any fundamental flaws with this approach that I'm overlooking. Maybe there's a good reason I'm not thinking of that we should exhaustively advance the sub-iterators (as we do today)?