apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.62k stars 1.02k forks source link

Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation #13029

Closed rquesada-tibco closed 1 month ago

rquesada-tibco commented 8 months ago

Description

We recently discovered a performance degradation on our project when going from Lucene 9.4 to 9.9. The cause seems to be a side effect of https://github.com/apache/lucene/commit/c6667e709f610669b57a6a1a6ab1cc80f4f0ebaf and https://github.com/apache/lucene/commit/3809106602a9675f4fd217b1090af4505d4ec2a7

The situation is as follows: we have a WildcardQuery and a TermInSetQuery which are and-combined (within a BooleanQuery). This structure gets executed repeatedly, kind of like a nested loop where the WildcardQuery remains the same, but the TermInSetQuery keeps changing its terms. In the old version, this was fast because the WildcardQuery was cached within the LRUQueryCache. However in the new version this is no longer the case, so the execution time of our scenario has increased.

Why our WildcardQuery is not cached any more? It boils down to this line in LRUQueryCache, where the cache operation won't happen if the cost estimation is too high:

final ScorerSupplier supplier = in.scorerSupplier(context);
...
final long cost = supplier.cost();
...
// skip cache operation which would slow query down too much
if (cost / skipCacheFactor > leadCost) {
...

Before the upgrade to 9.9, that cost was provided by a ConstantScoreWeight returned by the old MultiTermQueryConstantScoreWrapper (which was returned by the default RewriteMethod), which in the end was just based on the "default" Weight#scoreSupplier implementation: basically the cost was the scorer.iterator().cost(); and in our case the WildcardQuery returns just one document, so cost 1.

After the upgrade, the default RewriteMethod has changed and now this cost is provided by AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier here, and for that purpose a private estimateCost method was introduced, which bases the estimation on the MultiTermQuery#getTermsCount value. The problem is that, for our WildcardQuery (in fact for any sub-class of AutomatonQuery), this value is unknown, i.e. -1, so the estimateCost method just returns terms.getSumDocFreq(), which is clearly an overestimation in our case, so it prevents the caching, and leads to a performance degradation.

I understand that I can fix this situation by writing my customized RewriteMethod. The question is: could we improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier#cost so that, if the MTQ cannot provide a term count (getTermsCount() == -1) then we return scorer.iterator().cost() ?

msfroh commented 6 months ago

I think the else clause for the cost estimate is also not great.

I came across this same problem where a user was essentially running a single-term TermInSetQuery (that actually matched a single doc) AND a numeric range query that matches millions of docs. I was shocked when profiler output showed a lot of time spent in the BKReader -- the IndexOrDocValues query over the range query should have clearly taken the doc values path.

Let's say you have a string field with 10 million distinct values (so 10 million terms), and they match 20 million documents (with individual terms matching 1-3 docs, say). My read is that this estimateCost() function will say that a TermInSetQuery over a single term has cost 10,000,001 (i.e. 20 million sumDocFreq minus 10 million for the terms size, plus 1 for the query term count).

I get that the absolute worst case is that 9,999,999 terms each have doc freq 1 and the remaining term has doc freq 10,000,001, but this feels silly as a cost estimate for a query that is just going to rewrite to a single TermQuery with cost <= 3.

msfroh commented 6 months ago

I get that part of the point of this cost estimate is to avoid the (potentially-expensive) rewrite if, e.g. we can do a doc-value rewrite instead, but I'm thinking we could do something a little bit more term-aware.

How expensive is MultiTermQuery#getTermsEnum()? I think it's cheap, but I'm not positive. Maybe we could get it, try walking through it, summing doc freqs, giving up if the number of terms gets big.

We could go down that path only if the cost estimate from the existing logic is very high. I can try sketching out a PR with a test for it.

rquesada-tibco commented 4 months ago

fyi @jboeuf-tibco, @fledigol-tibco