apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.45k stars 973 forks source link

NRT failure due to FieldInfo & File mismatch #13353

Closed benwtrent closed 1 month ago

benwtrent commented 1 month ago

Description

There has been a nasty test failure in ES for awhile: https://github.com/elastic/elasticsearch/issues/105122

The test simulates a document indexing failure. It turns out, that this test failure is caused by a series of strange conditions in Lucene. If we fail on indexing a field, but have points value field that comes AFTER the field that is indexing, things will blow up when opening a reader if the writer has soft-deletes enabled.

The failure description is as follows:

Test that replicates the failure ```java public void testExceptionJustBeforeFlushWithPointValues() throws Exception { Directory dir = newDirectory(); Analyzer analyzer = new Analyzer(Analyzer.PER_FIELD_REUSE_STRATEGY) { @Override public TokenStreamComponents createComponents(String fieldName) { MockTokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false); tokenizer.setEnableChecks( false); // disable workflow checking as we forcefully close() in exceptional cases. TokenStream stream = new CrashingFilter(fieldName, tokenizer); return new TokenStreamComponents(tokenizer, stream); } }; DirectoryReader r = null; IndexWriterConfig iwc = newIndexWriterConfig(analyzer).setCommitOnClose(false).setMaxBufferedDocs(3); MergePolicy mp = iwc.getMergePolicy(); iwc.setMergePolicy( new SoftDeletesRetentionMergePolicy("soft_delete", MatchAllDocsQuery::new, mp)); IndexWriter w = RandomIndexWriter.mockIndexWriter(dir, iwc, random()); Document newdoc = new Document(); newdoc.add(newTextField("crash", "do it on token 4", Field.Store.NO)); newdoc.add(new IntPoint("int", 17)); expectThrows(IOException.class, () -> w.addDocument(newdoc)); try { r = w.getReader(false, false); } catch (AlreadyClosedException ace) { // expected } dir.close(); } ```

The exception thrown is:

        Caused by:
        java.io.FileNotFoundException: No sub-file with id .kdi found in compound file "_0.cfs" (fileName=_0.kdi files: [_Lucene99_0.tip, .nvm, .fnm, .tvd, _Lucene99_0.doc, _Lucene99_0.tim, _Lucene99_0.pos, .tvm, _Lucene99_0.tmd, .fdm, .nvd, .fdx, .tvx, .fdt])
            at org.apache.lucene.codecs.lucene90.Lucene90CompoundReader.openInput(Lucene90CompoundReader.java:170)
            at org.apache.lucene.codecs.lucene90.Lucene90PointsReader.<init>(Lucene90PointsReader.java:63)
            at org.apache.lucene.codecs.lucene90.Lucene90PointsFormat.fieldsReader(Lucene90PointsFormat.java:74)
            at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:152)
            ... 55 more

Version and environment details

No response

benwtrent commented 1 month ago

I am having a difficult time figuring out how to fix this. It seems to me that if the segment is "hard deleted", we should reset all its FieldInfos as there isn't any data written in it at all.

But, I am not sure the individual processDoc action can do this as it only knows about the documents it added.

benwtrent commented 1 month ago

What makes matters worse, is that it doesn't even have to be ALL docs that failed, just some of them that had point values (or knn vector values, etc.). Anything that eagerly updates FieldInfos but don't actually get flushed could trigger this weird behavior when opening the NRT reader.