Fix IndexOutOfBoundsException thrown in DefaultPassageFormatter by unordered matches

scampi commented 2 months ago

test: add unit tests to reproduce the IndexOutOfBoundsException in DefaultPassageFormatter
doc: clarify javadoc of MatchesIterator
fix: sort passages by offset if positions are missing

Fix #12431

Description

DefaultPassageFormatter may cause an IndexOutOfBoundsException in case matches of a passage are out of order. This happens when term vectors are stored but not the positions.

The tentative solution consists in ordering matches according to offsets in DisjunctionMatchesIterator in case positions don't exist. However, I wonder if it is correct to create such an iterator when positions are not available. Matches with no terms are explicitly removed, and the javadoc of MATCH_WITH_NO_TERMS mentions that it indicates a match with no term positions. Therefore, from that doc it shouldn't be possible to highlight text without positions. Is that javadoc correct ? Should it be updated ?

https://github.com/apache/lucene/blob/bc678ac67e32c55a27a4e8950c25144cc89cef66/lucene/core/src/java/org/apache/lucene/search/MatchesUtils.java#L67

https://github.com/apache/lucene/blob/bc678ac67e32c55a27a4e8950c25144cc89cef66/lucene/core/src/java/org/apache/lucene/search/MatchesUtils.java#L40-L44

github-actions[bot] commented 1 month ago

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

mikemccand commented 1 month ago

What a fun and tricky corner case -- thank you @scampi for uncovering this, showing the bug with the added unit tests, and the tentative fix.

I think it is actually technically allowed to index term vectors with offsets only (no positions) and then hilite from there: Lucene is recording the character offsets of every occurence of each unique term, per document, and that's all that is needed to hilite. So I like your fix -- I'll merge it soon unless anyone objects.

Could you maybe add a lucene/CHANGES.txt entry?

I think the intent of MATCH_WITH_NO_TERMS is different: that is for queries against Lucene index structures that fundamenally do not track positional information (doc values, points). With term vectors, the positional information indeed exists, it's just that the user may choose to record only offsets.

github-actions[bot] commented 3 weeks ago

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

scampi commented 3 weeks ago

@mikemccand @romseygeek thanks for the review!

What a fun and tricky corner case -- thank you @scampi for uncovering this, showing the bug with the added unit tests, and the tentative fix.

I actually just contributed the fix, @hossman found the issue and posted the test.

Could you maybe add a lucene/CHANGES.txt entry?

I have added a commit for this.

github-actions[bot] commented 1 week ago

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

scampi commented 1 week ago

@dweiss thanks!

apache / lucene

Fix IndexOutOfBoundsException thrown in DefaultPassageFormatter by unordered matches #13315

Description