Closed scampi closed 1 week ago
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!
What a fun and tricky corner case -- thank you @scampi for uncovering this, showing the bug with the added unit tests, and the tentative fix.
I think it is actually technically allowed to index term vectors with offsets only (no positions) and then hilite from there: Lucene is recording the character offsets of every occurence of each unique term, per document, and that's all that is needed to hilite. So I like your fix -- I'll merge it soon unless anyone objects.
Could you maybe add a lucene/CHANGES.txt
entry?
I think the intent of MATCH_WITH_NO_TERMS
is different: that is for queries against Lucene index structures that fundamenally do not track positional information (doc values, points). With term vectors, the positional information indeed exists, it's just that the user may choose to record only offsets.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!
@mikemccand @romseygeek thanks for the review!
What a fun and tricky corner case -- thank you @scampi for uncovering this, showing the bug with the added unit tests, and the tentative fix.
I actually just contributed the fix, @hossman found the issue and posted the test.
Could you maybe add a lucene/CHANGES.txt entry?
I have added a commit for this.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!
@dweiss thanks!
Fix #12431
Description
DefaultPassageFormatter
may cause anIndexOutOfBoundsException
in case matches of a passage are out of order. This happens when term vectors are stored but not the positions.The tentative solution consists in ordering matches according to offsets in
DisjunctionMatchesIterator
in case positions don't exist. However, I wonder if it is correct to create such an iterator when positions are not available. Matches with no terms are explicitly removed, and the javadoc ofMATCH_WITH_NO_TERMS
mentions that it indicates a match with no term positions. Therefore, from that doc it shouldn't be possible to highlight text without positions. Is that javadoc correct ? Should it be updated ?https://github.com/apache/lucene/blob/bc678ac67e32c55a27a4e8950c25144cc89cef66/lucene/core/src/java/org/apache/lucene/search/MatchesUtils.java#L67
https://github.com/apache/lucene/blob/bc678ac67e32c55a27a4e8950c25144cc89cef66/lucene/core/src/java/org/apache/lucene/search/MatchesUtils.java#L40-L44