apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.05k forks source link

TermsQuery as MultiTermQuery can dramatically overestimate its cost #12483

Open romseygeek opened 1 year ago

romseygeek commented 1 year ago

Description

I've run across an interestingly adversarial setup for the new TermsQuery implementation. We have an index that is using a block-join structure, with the parent document being a merchant record and the children being individual transactions. Doing a range query filtered by some specific merchant ids turns out to be surprisingly slow. The range query in particular takes a long time, because it covers a large portion of the document space but can't use some of our shortcut heuristics (eg LUCENE-7641 that will invert the search to find docs that don't match) because the parent docs don't have the timestamp field. In combination with the filter, though, I would have expected things to still be quick, because the range query is using IndexOrDocValuesQuery and so a filter that narrows the search down to a fraction of the index ought to select the doc-by-doc checking path. However, it turns out that the cost estimation code in AbstractMultiTermQueryConstantScoreWrapper will calculate a very large cost if the field you're filtering on isn't an ID field - in this case, we have a sort of mid-level cardinality where each value represents a few percent of the index, which ends up yielding a cost estimate of the total number of docs in the index minus 100 or so. Explicitly using a boolean disjunction instead of a TermsQuery yields a much more accurate cost estimate and correctly selects doc-by-doc range checking, giving a much more performant query.

We used to automatically rewrite TermsQuery to a simple boolean disjunction if there were fewer than 16 terms. I wonder if the more complex machinery we are using now is overkill for these small term sets, and we should just go back to this simple rewrite in those cases?

gsmiller commented 3 months ago

Hey @romseygeek - there's been a recent improvement to AbstractMultiTermQueryConstantScoreWrapper that may help the use-case you've described here (#13201). I'm curious if this solves the performance regression you're seeing, or if you still have issues. This change will only help when there are 16 or fewer terms in the terms query, so if you have more than that, I suspect you won't see any improvements. The change was just merged onto main and branch_9x today, so you'd have to cherry-pick it down (or wait for 9.12).

romseygeek commented 3 months ago

Hey @gsmiller, thanks for the ping! I'm working elsewhere now, but IIRC @mayya-sharipova was looking at similar issues in the past so may be able to reproduce the issue and see if this has improved things?

gsmiller commented 3 months ago

Got it, thanks @romseygeek

gsmiller commented 3 months ago

@mayya-sharipova do you have any additional information on this regression by chance? I know you did some work back in #13454 to help in this space, then @msfroh did work in #13201 to build on your change for better cost estimation. There's definitely still a situation where we dramatically over-estimate cost (when there are > 16 terms and it's not a PK-like field), but I'd be curious if we have real-world scenarios where we experience this that might motivate us to continue pursuing better cost estimation strategies. Thoughts?