apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

TopTermsBlendedFreqScoringRewrite should use SynonymQuery [LUCENE-8840] #9884

Open asfimport opened 5 years ago

asfimport commented 5 years ago

Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite method for Fuzzy queries, uses the BlendedTermQuery to score documents that match the fuzzy terms. This query blends the frequencies used for scoring across the terms and creates a disjunction of all the blended terms. This means that each fuzzy term that match in a document will add their BM25 score contribution. We already have a query that can blend the statistics of multiple terms in a single scorer that sums the doc frequencies rather than the entire BM25 score: the SynonymQuery. Since #9698 this query also handles boost between 0 and 1 so it should be easy to change the default rewrite method for Fuzzy queries to use it instead of the BlendedTermQuery. This would bound the contribution of each term to the final score which seems a better alternative in terms of relevancy than the current solution.


Migrated from LUCENE-8840 by Jim Ferenczi (@jimczi), 1 vote, updated Jun 12 2019 Attachments: LUCENE-8840.patch

asfimport commented 5 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

Here is a patch that replaces the BlendedTermQuery with the SynonymQuery in the TopTermsBlendedFreqScoringRewrite. The rewrite method can assign a boost of 0 to some terms so I added the support for it in the SynonymQuery.

I also ran a benchmark on luceneutil to evaluate the impact:

Report after iter 19:
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                  Fuzzy1       66.83     (17.5%)       45.44     (11.8%)  -32.0% ( -52% -   -3%)
                  Fuzzy2       54.71     (17.3%)       41.18     (12.9%)  -24.7% ( -46% -    6%)

The impact in terms of performance is quite big but it is expected since the SynonymQuery needs to merge the impacts of the different terms while the BlendedTermQuery considers each term independently. However I don't think this should prevent this change since the scoring for fuzzy terms should be greatly improved with the new approach and we can look at improving the SynonymQuery in a follow up.

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

I am curious to understand how including doc frequencies can be better than the overall score. IMO, including BM25 scores gives us some additional advantages, such as defending against cases where the overall non matching token count in a document is significantly high. Did you see any scenarios that had relevance troubles due to inclusion of entire BM25 scores?

 

IMO if we want to restrict the contribution of each term to the blended query's final score, then we could think of a blended scorer step which utilizes something on the lines of BM25's term frequency saturation when merging scores from different blended terms. WDYT?

 

On a different note, I am also wondering if we should devise relevance tests which allow us to measure the relevance impact of a change. Something added to luceneutil should be nice. Thoughts?

asfimport commented 5 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

I am curious to understand how including doc frequencies can be better than the overall score. IMO, including BM25 scores gives us some additional advantages, such as defending against cases where the overall non matching token count in a document is significantly high. Did you see any scenarios that had relevance troubles due to inclusion of entire BM25 scores?

The idea of the SynonymQuery is to score the terms as if they were indexed as a single term. I think this fits nicely with the fuzzy query. For instance imagine a fuzzy query with the terms "bad" and "baz". With the current solution if a document contains both terms it will rank significantly higher than documents that contain only one of them. This can change depending on the inner doc frequencies but this doesn't seem right IMO. On the contrary the synonym query would give the same score to documents containing "baz" with a frequency of 4 than another document containing "bad" and "baz" 2 times. This feels more natural to me because we shouldn't favor documents that contain multiple variations of the same fuzzy term.

On a different note, I am also wondering if we should devise relevance tests which allow us to measure the relevance impact of a change. Something added to luceneutil should be nice. Thoughts?

That would be great but this doesn't look like a low hanging fruit. Maybe open a separate issue to discuss ?

IMO if we want to restrict the contribution of each term to the blended query's final score, then we could think of a blended scorer step which utilizes something on the lines of BM25's term frequency saturation when merging scores from different blended terms. WDYT?

I am not sure I fully understand but the SynonymQuery kind of does that. It sums the inner doc frequencies of all matching terms to ensure that the contribution of each term to the final score is bounded.

asfimport commented 5 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

we shouldn't favor documents that contain multiple variations of the same fuzzy term.

 

For fuzzy I agree that rewarding more variations in a doc is probably undesirable - a doc will normally pick one spelling for a word and use it consistently so any variations are more likely to be false positives (your baz/bad example). Plurals and other forms of suffix would be a notable exception but I don't think that's too much of a problem because:

  1. we can assume that stemming is taking care of normalizing these tokens.
  2. a lot of fuzzy querying is for things like people names that aren't expressed as plurals or with other common suffixes

 

I think all forms of automatic expansions (synonym, fuzzy, wildcard) need a form of score blending for the expansions they create. Wildcards are perhaps unlike fuzzy in that finding multiple variations in a doc is desirable - we are looking for multiple forms and a document that contains many is better than few.