apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.05k forks source link

PassageScorer could take proximity of terms into account [LUCENE-10011] #11050

Open asfimport opened 3 years ago

asfimport commented 3 years ago

The UnifiedHighlighter scores its highlighted passages using a modified term frequency calculation, similar to BM25.  This means that two passages containing the same set of terms will score equivalently.  Given that proximity is often a reasonable proxy for relevance, and that passages contain the offsets of their internal hits, it would be useful to add the option of also weighting by the difference between the start of the first hit and the end of the last within the passage.


Migrated from LUCENE-10011 by Alan Woodward (@romseygeek)

asfimport commented 3 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Given that Passage#getLength() is only called by PassageScorer for use as a norm we could just modify the return value here if we think that this sort of weighting will always be useful.