CommonsTermsQuery with huge no. of terms slower with top-k scoring [LUCENE-9107]

asfimport commented 4 years ago

In [1] a CommonTermsQuery is used in order to perform a query with lots of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low frequency terms, the query, although big, finishes in around 2-300ms with Lucene 7.6.0. However, when upgrading the code to Lucene 8.x, the query runs in 2-3s instead [2]. After digging a bit into it it seems that the regression in speed comes from the fact that top-k scoring introduced by default in version 8 is causing that, not sure "where" exactly in the code though. When switching back to complete hit scoring [3], the speed goes back to the initial 2-300ms also in Lucene 8.3.x. It'd be nice to understand the reason why this is happening and if it is only concerning CommonTermsQuery or affecting BooleanQuery as well. If this is a case that depends on the data and application involved (Anserini in this case), the application should handle it, otherwise if it is a regression/bug in Lucene it'd be nice to fix it.

[1] : https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java [2] : https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java [3] : https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174

Screenshot 2020-08-07 at 16.20.01.png

Migrated from LUCENE-9107 by Tommaso Teofili (@tteofili), updated Aug 07 2020 Attachments: image-2020-08-07-16-54-27-905.png, Screenshot 2020-08-07 at 16.20.01.png, Screenshot 2020-08-07 at 16.20.05.png

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

CommonTermsQuery probably makes the issue worse by having clauses on multiple levels of boolean queries (see e.g. how the nested boolean queries perform worse than single-level boolean queries in the nightly benchmarks http://people.apache.org/\~mikemccand/lucenebench/AndMedOrHighHigh.html), but this is an issue with BooleanQuery too. We have complex logic that tries to skip as many hits as possible, but when this logic is defeated, which is typically the case when

there are lots of clauses,
or clauses have about the same max scores,
or maximum score upper bounds are highly overestimated (ClassicSimilarity might contribute a bit here too), then we need to pay the price for this overhead without getting any benefits.

What latency do you get if you run a pure disjunction with these clauses instead of a CommonTermsQuery?

asfimport commented 4 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

thanks Adrien for looking into this, I've tried with a pure disjunction (BooleanQuery) and the numbers are about the same as with CommonTermsQuery. ClassicSimilarity slowness contribution is non trivial: top-k scoring with ClassicSimilarity ranges 2 to 2.5 seconds, whereas it ranges 1.5 to 2 seconds with BM25Similarity.

asfimport commented 4 years ago

Vincenzo D'Amore (@freedev) (migrated from JIRA)

Hi, I did a little step further trying to identify the difference of performance using CommonTermsQuery with different versions of Solr (7.3.1 vs 8.6.0).

In this fork of anserini repo branch test_8.6.0 https://github.com/freedev/anserini/blob/test_8.6.0

There I was trying the ann sample, here the steps to reproduce the problem: copy and build

git clone [https://github\.com/freedev/anserini\.git] git checkout test_8\.6\.0 mvn -Prelease clean package

create the lucene index

java -cp target/anserini-0\.9\.5-SNAPSHOT-fatjar\.jar io\.anserini\.ann\.IndexVectors -input glove\.6B\.300d\.txt -path glove300-idx-8\.6\.0 -encoding fw

reproduce the issue (the vector used for the world apple is hardcoded into the ApproximateNearestNeighborSearch main)

java -cp target/anserini-0\.9\.5-SNAPSHOT-fatjar\.jar io\.anserini\.ann\.ApproximateNearestNeighborSearch -input glove\.6B\.300d\.txt -path glove300-idx-8\.6\.0 -encoding fw -word apple

This is the VisualVM Sampler output after having monitored ApproximateNearestNeighborSearch with Java Flight Recorder

Changing the line 186 in ApproximateNearestNeighborSearch

from:

TopScoreDocCollector.create(indexArgs.depth, 0);

to:

TopScoreDocCollector.create(indexArgs.depth, Integer.MAX_VALUE);

greately reduces the time spent (from \~2 sec to 3-400 milliseconds), see the screenshot:

Screenshot 2020-08-07 at 16.20.05.png

apache / lucene

CommonsTermsQuery with huge no. of terms slower with top-k scoring [LUCENE-9107] #10149