Closed rquesada-tibco closed 1 month ago
I think the else
clause for the cost estimate is also not great.
I came across this same problem where a user was essentially running a single-term TermInSetQuery
(that actually matched a single doc) AND a numeric range query that matches millions of docs. I was shocked when profiler output showed a lot of time spent in the BKReader -- the IndexOrDocValues query over the range query should have clearly taken the doc values path.
Let's say you have a string field with 10 million distinct values (so 10 million terms), and they match 20 million documents (with individual terms matching 1-3 docs, say). My read is that this estimateCost()
function will say that a TermInSetQuery
over a single term has cost 10,000,001 (i.e. 20 million sumDocFreq
minus 10 million for the terms size
, plus 1 for the query term count).
I get that the absolute worst case is that 9,999,999 terms each have doc freq 1 and the remaining term has doc freq 10,000,001, but this feels silly as a cost estimate for a query that is just going to rewrite to a single TermQuery
with cost <= 3.
I get that part of the point of this cost estimate is to avoid the (potentially-expensive) rewrite if, e.g. we can do a doc-value rewrite instead, but I'm thinking we could do something a little bit more term-aware.
How expensive is MultiTermQuery#getTermsEnum()
? I think it's cheap, but I'm not positive. Maybe we could get it, try walking through it, summing doc freqs, giving up if the number of terms gets big.
We could go down that path only if the cost estimate from the existing logic is very high. I can try sketching out a PR with a test for it.
fyi @jboeuf-tibco, @fledigol-tibco
Description
We recently discovered a performance degradation on our project when going from Lucene 9.4 to 9.9. The cause seems to be a side effect of https://github.com/apache/lucene/commit/c6667e709f610669b57a6a1a6ab1cc80f4f0ebaf and https://github.com/apache/lucene/commit/3809106602a9675f4fd217b1090af4505d4ec2a7
The situation is as follows: we have a
WildcardQuery
and aTermInSetQuery
which are and-combined (within aBooleanQuery
). This structure gets executed repeatedly, kind of like a nested loop where theWildcardQuery
remains the same, but theTermInSetQuery
keeps changing its terms. In the old version, this was fast because theWildcardQuery
was cached within theLRUQueryCache
. However in the new version this is no longer the case, so the execution time of our scenario has increased.Why our
WildcardQuery
is not cached any more? It boils down to this line inLRUQueryCache
, where the cache operation won't happen if the cost estimation is too high:Before the upgrade to 9.9, that cost was provided by a
ConstantScoreWeight
returned by the oldMultiTermQueryConstantScoreWrapper
(which was returned by the defaultRewriteMethod
), which in the end was just based on the "default"Weight#scoreSupplier
implementation: basically the cost was thescorer.iterator().cost();
and in our case theWildcardQuery
returns just one document, so cost 1.After the upgrade, the default
RewriteMethod
has changed and now this cost is provided byAbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier
here, and for that purpose a private estimateCost method was introduced, which bases the estimation on the MultiTermQuery#getTermsCount value. The problem is that, for ourWildcardQuery
(in fact for any sub-class ofAutomatonQuery
), this value is unknown, i.e.-1
, so theestimateCost
method just returnsterms.getSumDocFreq()
, which is clearly an overestimation in our case, so it prevents the caching, and leads to a performance degradation.I understand that I can fix this situation by writing my customized
RewriteMethod
. The question is: could we improveAbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier#cost
so that, if the MTQ cannot provide a term count (getTermsCount() == -1
) then we returnscorer.iterator().cost()
?