apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.05k forks source link

UnifiedHighlighter ANALYSIS mode does not accurately highlight SpanNotQuery or MUST_NOT [LUCENE-9426] #10466

Open asfimport opened 4 years ago

asfimport commented 4 years ago

If UnifiedHighlighter uses MemoryIndexOffsetStrategy, it does not treat SpanNotQuery correctly. Since UnifiedHighlighter uses actual search in order to determine which locations to highlight, it should be consistent with search and only highlight locations in a document that really match the query. However, it does not for SpanNotQuery.

For the query spanNot(spanNear([content:100, content:dollars], 1, true), content:thousand, 0, 0) it produces A <b>100</b> fucking <b>dollars</b> wasn't enough to fix it. ... We need <b>100</b> thousand <b>dollars</b> to buy the house


Migrated from LUCENE-9426 by Christoph Goller Environment:

I tested with 8.5.1, but other versions are probably also affected.

Attachments: TestUnifiedHighlighter.java

asfimport commented 4 years ago

Christoph Goller (migrated from JIRA)

Analysis:

 

With PostingsOffsetStrategy highlighting for SpanNotQuery works correctly.

 

With MemoryIndexOffsetStrategy UnifiedHighligher creates an In-Memory Index of the document that must be highlighted. However, it does not use the tokenstream produced by the indexAnalyzer. Instead it aplies a FilteringTokenFilter throwing away all tokens that do not occur in the query. I guess this is done for efficiency reasons. The filter is based on an automaton that is built by MultiTermHighlighting. MultiTermHighlighting is based on the Visitor concept and it ignores all subqueries that have BooleanClause.Occur.MUST_NOT. While this may be correct for a Boolean NOT-query, it is not correct for a SpanNotQuery. In the above example we need the SpanNot token. Otherwise the query logic is corrupted.

 

As a fix I recommend to add all tokens form the query even if they have BooleanClause.Occur.MUST_NOT. Still the index remains small, but query logic will be correct.

 

I attatch a unit test that demonstrates the problem.

asfimport commented 4 years ago

David Smiley (@dsmiley) (migrated from JIRA)

As a fix I recommend to add all tokens form the query even if they have BooleanClause.Occur.MUST_NOT. Still the index remains small, but query logic will be correct.

Makes sense to me.  Or alternatively, just disable this optimization if any NOT is found if it's too difficult to do what you suggest.

Note that if you put offsets in your main index for this field, you'll not be susceptible to this problem because there will be no re-analysis + MemoryIndex.