UnifiedHighlighter incorrectly returns field 'X' was indexed without offsets

mayya-sharipova commented 8 months ago

Description

UnifiedHighlighter based on matches incorrectly returns field 'X' was indexed without offsets, cannot highlight

Test to reproduce:

 static final FieldType textType = new FieldType(TextField.TYPE_STORED);
    static {
        textType.setStoreTermVectors(true);
        textType.setStoreTermVectorPositions(true);
        textType.setStoreTermVectorOffsets(true);
        textType.freeze();
    }

    public void testHighlgiht() {
        String indexPath = "../lucene-test-indices/index1";
        Path path = Paths.get(indexPath);
        try {
            Directory directory = NIOFSDirectory.open(path);
            Analyzer analyzer = new ClassicAnalyzer();
            IndexWriterConfig config = new IndexWriterConfig(analyzer);

            try (IndexWriter writer = new IndexWriter(directory, config)) {
                addDoc(writer, "The quick brown fox jumps over the lazy dog");
            }

            try (IndexReader reader = DirectoryReader.open(directory)) {
                IndexSearcher searcher = new IndexSearcher(reader);
                Query query = new IntervalQuery("content",
                        Intervals.analyzedText("quick brown fox jumps over the lazy dog", analyzer, "content", 0, true));
                TopDocs topDocs = searcher.search(query, 10);

                UnifiedHighlighter.Builder uhBuilder = new UnifiedHighlighter.Builder(searcher, analyzer)
                        .withWeightMatches(true);
                UnifiedHighlighter highlighter = new UnifiedHighlighter(uhBuilder);

                String[] highlights = highlighter.highlight("content", query, topDocs, 1);
                System.out.println(Arrays.toString(highlights));
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

   private static void addDoc(IndexWriter writer, String content) throws IOException {
        Document doc = new Document();
        doc.add(new Field("content", content, textType));
        writer.addDocument(doc);
    }

produces an error:

java.lang.IllegalArgumentException: field 'content' was indexed without offsets, cannot highlight

    at org.apache.lucene.search.uhighlight.FieldHighlighter.highlightOffsetsEnums(FieldHighlighter.java:157)
    at org.apache.lucene.search.uhighlight.FieldHighlighter.highlightFieldForDoc(FieldHighlighter.java:83)
    at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFieldsAsObjects(UnifiedHighlighter.java:944)
    at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:814)
    at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:792)
    at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlight(UnifiedHighlighter.java:725)

A workaround to disable highlighting based on matches:

 UnifiedHighlighter.Builder uhBuilder = new UnifiedHighlighter.Builder(searcher, analyzer)
                        .withWeightMatches(false);

This happens because of ClassicAnalyzer that removes stop words, and because of it usage of ExtendedIntervalsSource that returns -1 offsets.

Version and environment details

Lucene v 9.9.1

scampi commented 3 weeks ago

ExtendedIntervalsSource explicitly returns -1, and this was done in https://github.com/apache/lucene/pull/803 (ticket LUCENE-10229).

https://github.com/apache/lucene/blob/6d987e1ce1c3f3215633a979ce048829fe1bb6ed/lucene/queries/src/java/org/apache/lucene/queries/intervals/ExtendedIntervalsSource.java#L89-L94

From the ticket:

The reason extend does not work for highlighting is that, quite reasonably, it can only return the offsets delegated from the source interval. Once you shift left or right from the source interval's position, the offset information cannot be retrieved (because this would require per-document, random-access position-offset map to be present somewhere).

therefore, is it normal that your example fails, or is it an edge case that wasn't covered by that ticket ? what would be the expected output ?

This happens because of ClassicAnalyzer that removes stop words, and because of it usage of ExtendedIntervalsSource that returns -1 offsets.

Just for clarity, it fails when highlighting lazy: an ExtendedIntervalsSource got created to account for the preceding stop word that got removed by the analyzer, which then returns -1 during highlighting.

scampi commented 3 weeks ago

In OffsetsFromPositions there is some logic to get offsets from positions.

https://github.com/apache/lucene/blob/53d1c2bd2fb3e6b9da590bee360996dbbdc8ea34/lucene/highlighter/src/java/org/apache/lucene/search/matchhighlight/OffsetsFromPositions.java#L62

Would it make sense to apply a similar logic in FieldHighlighter in the case where offsets are missing because of the ExtendedIntervalsSource use ?

https://github.com/apache/lucene/blob/53d1c2bd2fb3e6b9da590bee360996dbbdc8ea34/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java#L159-L162

apache / lucene

UnifiedHighlighter incorrectly returns field 'X' was indexed without offsets #13103

Description

Version and environment details