apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.72k stars 1.04k forks source link

Spans tree scoring [LUCENE-7580] #8631

Closed asfimport closed 6 years ago

asfimport commented 7 years ago

Recurse the spans tree to compose a score based on the type of subqueries and what matched


Migrated from LUCENE-7580 by Paul Elschot, 1 vote, resolved Apr 02 2018 Attachments: Elschot20170326Counting.pdf, LUCENE-7580.patch (versions: 4)

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

"Recurse the spans tree to compose a score based on the type of subqueries ... and what matched" was suggested in September 2007 on the java-user list http://www.gossamer-threads.com/lists/lucene/java-user/53027 .

Currently SpanScorer provides score values that have no real meaning when more than one SpanTermQuery is used.

Patch follows.

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

In the patch, SpansTreeQuery is a wrapper for SpanQuery that uses basically the same scoring as the scoring for other queries. When all term occurrences match at top level or at 0 distance the score is the same as the score for a boolean OR over the terms, independently of the Similarity that is used. SpansTreeScorer scores each query term matching occurrence, and it applies discounts for non matching terms and for distance matches. It also uses weights of subqueries.

The matching occurrences are recorded per document in the spans tree at each top level match of a document. For each match SpansTreeScorer descends the tree down to the leaf level of the terms of each match. SpansDocScorer objects are used as the tree nodes, there is one for each supported Spans.

Each matching term occurrence is recorded with a slop factor. At the top level this slop factor is normally 1, and for each span near nesting level the slop factor at the match is multiplied into this.

The term frequency scoring from the Similarity is used per matching term occurrence, and these term occurrence scores are weighted by the slop factors sorted in decreasing order. The purpose of using the given slop factors in decreasing order is to provide scoring consistency between span near queries that only differ in the maximum allowed slop. This consistency requires that an extra match with a lower slop increases the score of the document. I would expect scoring to be consistent this way, but I'm not 100% sure.

The non matching term occurrences get a score that is the difference of the normal document term frequency score and the term frequency score for the matching terms. This non matching score is weighted by the slop factor of a non matching distance. The non matching distance is a parameter that must be provided. This non matching distance can for example be chosen as a little larger than the largest distance used in the span near queries that are wrapped.

SpansTreeQuery is implemented for any combination of SpanNearQuery, SpanOrQuery, SpanTermQuery, SpanBoostQuery, SpanNotQuery, SpanFirstQuery, SpanContainingQuery and SpanWithinQuery.

See the javadocs and the test code on how to use SpansTreeQuery.

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

SpansTreeQuery is implemented as a wrapper in order to change the existing code as little as possible. But it was necessary to take DisjunctionSpans out of SpanOrQuery. In DisjunctionSpans there are only additions for inspection at a match, otherwise it is the same as in the current SpanOrQuery.

Changes to the current code are mostly additions to allow inspection of matches:

The only existing state that is changed is the use of needsScores (instead of the current false) for weights of subqueries of SpanOrQuery and SpanNearQuery and for the weight of the included subquery of SpanNotQuery.

All core tests pass with the patch applied on the master branch. Ant precommit also passes.

There is a correction to the javadocs of Similarity.Simscorer on the use of float for term frequencies.

The patch also adds a constructor for SpanOrQuery with an extra parameter maxDistance. When wrapped in a SpansTreeQuery, this SpanOrQuery will provide a slop factor at each match that is determined by the minimum distance between any two subspans where possible, and this distance is maximized to the given maxDistance. The class DisjunctionNearSpans and its SpansDocScorer implement this.

All score calculations are done with doubles. Most of the additions have public/protected visibility in order to allow easy extension.

In case there is interest in back porting this, a patch for branch_6x can be made available. The tests on branch_6x disable the coordination in BooleanQuery and they only use the BM25 similarity.

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

Some related issues, thanks for these discussions:

1611

LUCENE-2878

3953

LUCENE-2880

7288

LUCENE-6371

7525

LUCENE-7398

Some related web pages:

http://www.gossamer-threads.com/lists/lucene/java-user/33902 March 2006.

http://www.gossamer-threads.com/lists/lucene/java-user/53027 September 2007, suggests to: "recurse the spans tree to compose a score based on the type of subqueries (near, and, or, not) and what matched."

http://www.gossamer-threads.com/lists/lucene/java-user/60103 April 2008.

http://www.flax.co.uk/blog/2016/04/26/can-make-contribution-apache-solr-core-development/ see point 4.

How to use BM25: http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

What SpansTreeQuery does not do, and some rough edges:

The SpansDocScorer objects do the match recording and scoring, and there is one for each Spans. These SpansDocScorer objects might be merged into their Spans to reduce the number of objects. Related: how to deal with the same term occurring in more than one subquery? See also #8451.

Normally the term frequency score has a diminishing contribution for extra occurrences. In the patch the slop factors for a term are applied in decreasing order on these diminished contributions. This requires sorting of the slop factors. Sorting the slop factors could be avoided when an actual score of a single term occurrence was available. In that case the given slop factor could be used as a weight on that score. It might be possible to estimate an actual score for a single term occurrence from the distances to other occurrences of the same term. Similarly, the decreasing term frequency contributions can be seen as a proximity weighting for the same term (or subquery): the closer a term occurs to itself, the smaller its contribution. This might be refined by using the actual distances to other the term occurrences (or subquery occurrences) to provide a weight for each term occurrence. This is unusual because the weight decreases for smaller distances.

The slop factor from the Similarity may need to be adapted because of the way it is combined here with diminishing term contributions.

Another use of a score of each term occurrence could be to use the absolute term position to influence the score, possibly in combination with the field length.

There is an assert in TermSpansDocScorer.docScore() that verifies that the smallest occurring slop factor is at least as large as the non matching slop factor. This condition is necessary for consistency. Instead of using this assert, this condition might be enforced by somehow automatically determining the non matching slop factor.

This is a prototype. No profiling has been done, it will take more CPU, but I have no idea how much. Garbage collection might be affected by the reference cycles between the SpansDocScorers and their Spans.

Since this allows weighting of subqueries, it might be possible to implement synonym scoring in SpanOrQuery by providing good subweights, and wrapping the whole thing in SpansTreeQuery. The only thing that might still be needed then is a SpansDocScorer that applies the SimScorer.score() over the total term frequency of the synonyms in a document.

SpansTreeScorer multiplies the slop factor for nested near queries at each level. Alternatively a minimum distance could be passed down. This would need to change recordMatch(float slopFactor) to recordMatch(int minDistance). Would minDistance make sense, or is there a better distance?

What is a good way to test whether the score values from SpansTreeQuery actually improve on the score values from the current SpanScorer?

There are no tests for SpanFirstQuery/SpanContainingQuery/SpanWithinQuery. These tests are not there because these queries provide FilterSpans and that is already supported for SpanNotQuery.

The explain() method is not implemented for SpansTreeQuery. This should be doable with an explain() method added to SpansTreeScorer to provide the explanations.

There is no support for PayloadSpanQuery. PayloadSpanQuery is not in here because it is not in the core module. I think it can fit here in because PayloadSpanQuery also scores per matching term occurrence. Then Spans.doStartCurrentDoc() and Spans.doCurrentSpans() could be removed.

In case this is acceptable as a good way to score Spans: Spans.width() and Scorer.freq() and SpansDocScorer.docMatchFreq() might be removed. Would it make sense to implement child Scorers in the tree of SpansDocScorer objects?

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

Patch of 11 Dec 2016.

Add automatically determining a weight for non matching terms.

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

Compared to the previous patch, this adds a nonMatchSlop attribute to SpanNearQuery, and drops the nonMatchSlopFactor argument from SpansTreeQuery.

nonMatchSlop is the distance for determining a slop factor that is to be used for non matching occurrences of a SpanNearQuery. Smaller values for this distance will increase the score contribution of non matching occurrences via SimScorer.computeSlopFactor()

But smaller values for this distance, i.e. higher score contribution of non matching occurrences, may lead to a scoring inconsistency between two span near queries that only differ in the allowed slop. For example consider query A with a smaller allowed slop and query B with a larger one. For query B there can be more matches, and these should increase the score of B when compared to the score of A. So for each extra match at B, the non matching score for query A should be lower than the matching score for query B. This may not be the case when the non matching score contribution is too high.

To have consistent scoring between two such queries, choose a non matching slop that is larger than the largest allowed match slop, and provide that non matching slop to both queries. In case this consistency is not needed, nonMatchSlop can be chosen to be somewhat larger than the maximum allowed match slop.

This nonMatchSlop is used in SpansTreeWeight to compute a minimal nested slop factor from the maximum possible slops that can occur in a SpanQuery for the nested SpanNearQueries and for nested SpanOrQueries with distance. Finally, this minimal nested slop factor is used as the weight for scoring non matching terms.

The default nonMatchSlop for SpanNearQuery is large, Integer.MAX_VALUE/2. Therefore by default non matching occurrences have no real score contribution.

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

Some scientific articles on this subject:

Metzler, Donald, and W. Bruce Croft. "A Markov random field model for term dependencies." Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2005.

In section 2.3 they use terms and ordered and unordered phrases The ranking function is a weighted linear combination for these. The optimal weights are about 80/10/10 for simple terms, unordered, and ordered. Here this led to the use of a weighting factor non matching occurrences. They also found that the minimum distance is the best indicator of relevance.

Bendersky, Michael, and W. Bruce Croft. "Modeling Higher-Order Term Dependencies in Information Retrieval using Query Hypergraphs" SIGIR'12.

The concepts there can be nested, like span queries. The approach there is much more general. For example:

Blanco, Roi, and Paolo Boldi. "Extending BM25 with multiple query operators." Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2012.

This scores regions with BM25F.

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

I have started using this in the tests for the surround query language, I'll open an issue for that later.

This found a bug in ConjunctionNearSpansDocScorer.recordMatch(). A -1 slop can occur when the same term is used twice in a SpanNearQuery, and this causes a division by zero in computing the slop factor in from recordMatch(). This can be easily avoided by using 0 slop in such cases.

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

Patch of 29 Dec 2016.

Compared to the previous patch, this adds:

Limiting the max allowed slop to Integer.MAX_VALUE-1 in the SpanNearQuery constructor and in TestSpanSearchEquivalence. An actual slop of Integer.MAX_VALUE causes an overflow in distance+1 that is used in computeSlopFactor. Since the same limitation is already present for indexed positions, I would not expect this slop factor miscalculation to actually occur.

The negative slops that occur for overlapping spans are changed to 0 before passing them to computeSlopFactor in NearSpansDocScorer in the patch here.

The non match distance passed to SpanNearQuery in the patch is verified to be at least the given slop.

A wrapper method SpansTreeScorer.wrap() is added that will wrap the span (subqueries of a) given query in a SpansTreeQuery. This works for span subqueries of BooleanQuery, DisjunctionMaxQuery and BoostQuery.

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

Patch of 6 Jan 2017.

This contains:

The changes in the patch of 30 Dec 2016.

Support for SpanSynonymQuery, see SynonymSpans and SynonymSpansDocScorer.

Class AsSingleTermSpansDocScorer as common superclass for TermSpansDocScorer and SynonymSpansDocScorer. This is the place where matching and non matching term occurrences are scored with a SimScorer from Similarity while taking into account the slop factors.

Method SpansTreeQuery.wrapAfterRewrite() to use SpansTreeQuery.wrap() at the right moment.

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

SpanSynonymQuery is unusual here because it uses a single SpansDocScorer per segment, independent of the number of synonym terms.

Since the TermSpans for SynonymSpans are Spans without a SpansDocScorer it makes some sense not to merge Spans and SpansDocScorer later.

asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

GitHub user PaulElschot opened a pull request:

https://github.com/apache/lucene-solr/pull/166

LUCENE-7580 of 8 Mar 2017.

Resolves a conflict with recent simplification of NearSpanUnordered.
Includes recent SpanSynonymQuery.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/PaulElschot/lucene-solr lucene7580-20170308

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/166.patch

To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:

This closes `#166`


asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

I just pushed two branches to github, pullable as:

git pull https://github.com/PaulElschot/lucene-solr lucene7580-20170326

and

git pull https://github.com/PaulElschot/lucene-solr lucene7580report-20170326

The lucene7580-20170326 branch is an update of the previous pull request with a few minor improvements. Most notable is putting SpansTreeWeight into its own source file.

The lucene7580report-20170326 branch is on top of the lucene7580-20170326 branch, with the addition of the tex sources for a report on this issue. I'll attach the pdf shortly here.

asfimport commented 7 years ago

Paul Elschot (migrated from JIRA)

Report of 26 March 2017, generated from the lucene7580report-20170326 branch and renamed to include the full date in the name.

asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot closed the pull request at:

https://github.com/apache/lucene-solr/pull/166
asfimport commented 7 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot commented on the issue:

https://github.com/apache/lucene-solr/pull/166

Superseded earlier today.
asfimport commented 6 years ago

Paul Elschot (migrated from JIRA)

Resolved: not enough interest. I'll keep the github branches available for now.