apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

IndexWriter.getTragicException() may not reflect all corrupting exceptions (notably: NoSuchFileException) [LUCENE-8692] #9738

Open asfimport opened 5 years ago

asfimport commented 5 years ago

Backstory...

Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's corruptFiles to introduce corruption into the "leader" node's index and then assert that this solr node gives up it's leadership of the shard and another replica takes over.

This can currently fail sporadically (but usually reproducibly - see SOLR-13237) due to the leader not giving up it's leadership even after the corruption causes an update/commit to fail. Solr's leadership code makes this decision after encountering an exception from the IndexWriter based on wether IndexWriter.getTragicException() is (non-)null.


While investigating this, I created an isolated Lucene-Core equivilent test that demonstrates the same basic situation:

It's fairly easy to make my new test fail reproducibly – in every situation I've seen the underlying exception is a NoSuchFileException (ie: the randomly introduced corruption was to delete some file).


Migrated from LUCENE-8692 by Chris M. Hostetter (@hossman), updated Mar 12 2019 Attachments: LUCENE-8692_test.patch, LUCENE-8692.patch (versions: 3) Linked issues:

asfimport commented 5 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I'm attaching a patch with my test – there are some nocommits related to ways i think the test could/should be improved, but as it stands it doesn't take much beasting to find some seeds that reproduce fairly reliably for me (on master)...

   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriterExceptions -Dtests.method=testRandomCorruptionIsTragic -Dtests.seed=48825E06EC3D16C1 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=is-IS -Dtests.timezone=Etc/Zulu -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.47s | TestIndexWriterExceptions.testRandomCorruptionIsTragic <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: index update encountered throwable, but no tragic event recorded: java.nio.file.NoSuchFileException: _0_LuceneFixedGap_0.tib
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([48825E06EC3D16C1:A7D1C7299F88ED1B]:0)
   [junit4]    >    at org.apache.lucene.index.TestIndexWriterExceptions.testRandomCorruptionIsTragic(TestIndexWriterExceptions.java:2146)
   [junit4]    >    at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene80): {field=PostingsFormat(name=LuceneFixedGap)}, docValues:{}, maxPointsInLeafNode=1182, maxMBSortInHeap=6.385378646488135, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@1139fb33), locale=is-IS, timezone=Etc/Zulu
   [junit4]   2> NOTE: Linux 3.19.0-84-generic amd64/Oracle Corporation 1.8.0_144 (64-bit)/cpus=4,threads=1,free=194368568,total=249561088
   [junit4]   2> NOTE: All tests run in this JVM: [TestIndexWriterExceptions]
   [junit4] Completed [1/1 (1!)] in 0.91s, 1 test, 1 failure <<< FAILURES!
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriterExceptions -Dtests.method=testRandomCorruptionIsTragic -Dtests.seed=662E8085B38FC223 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=es-NI -Dtests.timezone=Asia/Jerusalem -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
   [junit4] FAILURE 0.73s | TestIndexWriterExceptions.testRandomCorruptionIsTragic <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: index update encountered throwable, but no tragic event recorded: java.nio.file.NoSuchFileException: _0_TestBloomFilteredLucenePostings_0.doc
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([662E8085B38FC223:897D19AAC03A39F9]:0)
   [junit4]    >    at org.apache.lucene.index.TestIndexWriterExceptions.testRandomCorruptionIsTragic(TestIndexWriterExceptions.java:2146)
   [junit4]    >    at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene80): {field=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128)))}, docValues:{}, maxPointsInLeafNode=191, maxMBSortInHeap=6.402103555115238, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@44d3ba93), locale=es-NI, timezone=Asia/Jerusalem
   [junit4]   2> NOTE: Linux 3.19.0-84-generic amd64/Oracle Corporation 1.8.0_144 (64-bit)/cpus=4,threads=1,free=229171096,total=249561088
   [junit4]   2> NOTE: All tests run in this JVM: [TestIndexWriterExceptions]
   [junit4] Completed [1/1 (1!)] in 1.18s, 1 test, 1 failure <<< FAILURES!

It's not clear to me if this is truly a bug in the IndexWriter's exception handling when propogating these NoSuchFileException or if MockDirectoryWrapper is being "too evil" and creating situations we don't expect IndexWriter to be able to recognize as tragic – allthough in my opinion it certainly seems like IW should treat files being deleted out from under it as tragic.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I had a quick look and every time the test failed for me it was in the maybeMerge call that is triggered by the commit() call, eg.

java.nio.file.NoSuchFileException: _0.fdx
    at org.apache.lucene.store.ByteBuffersDirectory.fileLength(ByteBuffersDirectory.java:155)
    at org.apache.lucene.store.MockDirectoryWrapper.fileLength(MockDirectoryWrapper.java:1009)
    at org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
    at org.apache.lucene.index.IndexWriter.registerMerge(IndexWriter.java:4176)
    at org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2197)
    at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2154)
    at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3455)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3407)
    at org.apache.lucene.index.TestIndexWriterExceptions.testRandomCorruptionIsTragic(TestIndexWriterExceptions.java:2136)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
    at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
    at java.lang.Thread.run(Thread.java:748)

IndexWriter#merge already treats any exception as a tragic exception, so I think it'd make sense to treat this exception as tragic as well.

asfimport commented 5 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Thanks for the sanity check Adrien.

I'm attaching a beefed up patch where i did the following...

There seems to be at least one other code path that's still problematic with NoSuchFileException: IndexWriter.startCommit ...

ant test  -Dtestcase=TestStressIndexing2 -Dtests.method=testRandomCorruptionIsTragic -Dtests.seed=9C4D4A5A52AD1D2F:94A0DF52F925D7B0 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=mk -Dtests.timezone=Pacific/Niue -Dtests.asserts=true -Dtests.file.encoding=US-ASCII

...

   [junit4]    > Throwable #1: java.lang.AssertionError: index update encountered throwable, but no tragic event recorded: java.nio.file.NoSuchFileException: /home/hossman/lucene/dev/lucene/build/core/test/J0/temp/lucene.index.TestStressIndexing2_9C4D4A5A52AD1D2F-001/index-SimpleFSDirectory-001/_0.si
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([9C4D4A5A52AD1D2F:94A0DF52F925D7B0]:0)
   [junit4]    >    at org.apache.lucene.index.TestStressIndexing2$CorruptibleIndexingThread.run(TestStressIndexing2.java:1025)
   [junit4]    >    at org.apache.lucene.index.TestStressIndexing2.testRandomCorruptionIsTragic(TestStressIndexing2.java:144)
   [junit4]    >    at java.lang.Thread.run(Thread.java:748)
   [junit4]    >    Suppressed: java.nio.file.NoSuchFileException: /home/hossman/lucene/dev/lucene/build/core/test/J0/temp/lucene.index.TestStressIndexing2_9C4D4A5A52AD1D2F-001/index-SimpleFSDirectory-001/_0.si
   [junit4]    >        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
   [junit4]    >        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
   [junit4]    >        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
   [junit4]    >        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
   [junit4]    >        at org.apache.lucene.mockfile.FilterFileSystemProvider.newFileChannel(FilterFileSystemProvider.java:202)
   [junit4]    >        at org.apache.lucene.mockfile.DisableFsyncFS.newFileChannel(DisableFsyncFS.java:45)
   [junit4]    >        at org.apache.lucene.mockfile.FilterFileSystemProvider.newFileChannel(FilterFileSystemProvider.java:202)
   [junit4]    >        at org.apache.lucene.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:166)
   [junit4]    >        at org.apache.lucene.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:166)
   [junit4]    >        at org.apache.lucene.mockfile.FilterFileSystemProvider.newFileChannel(FilterFileSystemProvider.java:202)
   [junit4]    >        at java.nio.channels.FileChannel.open(FileChannel.java:287)
   [junit4]    >        at java.nio.channels.FileChannel.open(FileChannel.java:335)
   [junit4]    >        at org.apache.lucene.util.IOUtils.fsync(IOUtils.java:460)
   [junit4]    >        at org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:332)
   [junit4]    >        at org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:287)
   [junit4]    >        at org.apache.lucene.store.MockDirectoryWrapper.sync(MockDirectoryWrapper.java:207)
   [junit4]    >        at org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4810)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3283)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3451)
   [junit4]    >        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3416)
   [junit4]    >        at org.apache.lucene.index.TestStressIndexing2$CorruptibleIndexingThread.run(TestStressIndexing2.java:1014)
   [junit4]    >        ... 37 more
asfimport commented 5 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I've updated the patch with 2 major changes:

There are still several nocommits in IndexWriter relating to code paths i'm just not clear on if/how we still need changes, but as things stand beasting results are relatively promising – as far as the test method passing ... the new test seems to have also uncovered some issues with IndexWriter leaking files in some tragic event situations, example...

   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestStressIndexing2 -Dtests.seed=B8840CCBBB15D5B1 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=nl-BE -Dtests.timezone=America/Toronto -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.00s | TestStressIndexing2 (suite) <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: file handle leaks: [FileChannel(/home/hossman/lucene/dev/lucene/build/core/test/J0/temp/lucene.index.TestStressIndexing2_B8840CCBBB15D5B1-001/index-NIOFSDirectory-001/_0_MockRandom_0.sd)]
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([B8840CCBBB15D5B1]:0)
   [junit4]    >    at org.apache.lucene.mockfile.LeakFS.onClose(LeakFS.java:63)
   [junit4]    >    at org.apache.lucene.mockfile.FilterFileSystem.close(FilterFileSystem.java:77)
   [junit4]    >    at org.apache.lucene.mockfile.FilterFileSystem.close(FilterFileSystem.java:78)
   [junit4]    >    at org.apache.lucene.mockfile.FilterFileSystem.close(FilterFileSystem.java:78)
   [junit4]    >    at java.lang.Thread.run(Thread.java:748)
   [junit4]    > Caused by: java.lang.Exception
   [junit4]    >    at org.apache.lucene.mockfile.LeakFS.onOpen(LeakFS.java:46)
   [junit4]    >    at org.apache.lucene.mockfile.HandleTrackingFS.callOpenHook(HandleTrackingFS.java:81)
   [junit4]    >    at org.apache.lucene.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:197)
   [junit4]    >    at org.apache.lucene.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:166)
   [junit4]    >    at org.apache.lucene.mockfile.FilterFileSystemProvider.newFileChannel(FilterFileSystemProvider.java:202)
   [junit4]    >    at java.nio.channels.FileChannel.open(FileChannel.java:287)
   [junit4]    >    at java.nio.channels.FileChannel.open(FileChannel.java:335)
   [junit4]    >    at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:81)
   [junit4]    >    at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:751)
   [junit4]    >    at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:157)
   [junit4]    >    at org.apache.lucene.store.MockDirectoryWrapper.openChecksumInput(MockDirectoryWrapper.java:1038)
   [junit4]    >    at org.apache.lucene.codecs.mockrandom.MockRandomPostingsFormat.fieldsProducer(MockRandomPostingsFormat.java:270)
   [junit4]    >    at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:288)
   [junit4]    >    at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:368)
   [junit4]    >    at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:113)
   [junit4]    >    at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:83)
   [junit4]    >    at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:172)
   [junit4]    >    at org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:271)
   [junit4]    >    at org.apache.lucene.index.FrozenBufferedUpdates.openSegmentStates(FrozenBufferedUpdates.java:329)
   [junit4]    >    at org.apache.lucene.index.FrozenBufferedUpdates.forceApply(FrozenBufferedUpdates.java:222)
   [junit4]    >    at org.apache.lucene.index.FrozenBufferedUpdates.tryApply(FrozenBufferedUpdates.java:160)
   [junit4]    >    at org.apache.lucene.index.IndexWriter.lambda$publishFrozenUpdates$3(IndexWriter.java:2598)
   [junit4]    >    at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5123)
   [junit4]    >    at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3218)
   [junit4]    >    at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3453)
   [junit4]    >    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3418)
   [junit4]    >    at org.apache.lucene.index.TestStressIndexing2$CorruptibleIndexingThread.run(TestStressIndexing2.java:1008)
   [junit4]    >    at org.apache.lucene.index.TestStressIndexing2.testRandomCorruptionIsTragic(TestStressIndexing2.java:144)

...those line numbers come from running the test with the full patch applied, but even if you revert the changes to IndexWriter and re-run the new test, the test methods all pass, but the suite exception still exists – meaning this seed results in corruption that's already correctly treated as tragic on master, but still results in IndexWriter leaking open files.


@jpountz: i'm out of my depth here, could you take another look and let me know what you think?

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

the new test seems to have also uncovered some issues with IndexWriter leaking files in some tragic event situations

This one is a bug in MockRandomPostingsFormat#fieldsProducer. It should use try-with-resources when opening in to make sure it eventually gets closed.

I think there is an issue with the patch with MergeAbortedExeption indeed given that registerMerge might throw such an exception. Maybe we should move this try block to registerMerge instead where we know which OneMerge is being registered (and is also where the exception is thrown when estimating the size of the merge).

@mikemccand @rmuir @s1monw You might want to have a look at this patch since you are more familiar with tragedy handling than I am.

asfimport commented 5 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I think there is an issue with the patch with MergeAbortedExeption indeed given that registerMerge might throw such an exception. Maybe we should move this try block to registerMerge instead where we know which OneMerge is being registered (and is also where the exception is thrown when estimating the size of the merge).

+1

-    } catch (VirtualMachineError tragedy) {
+    } catch (Throwable tragedy) {
       tragicEvent(tragedy, "startCommit");

I am not sure why we need to treat every exception as fatal in this case?

I also wonder if we could move this to a PR on github, iterations would be simpler and comments too. I can't tell which patch is relevant which one isn't.

asfimport commented 5 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

This one is a bug in MockRandomPostingsFormat#fieldsProducer. It should use try-with-resources when opening in to make sure it eventually gets closed.

Got it ... fixed in latest patch.


I think there is an issue with the patch with MergeAbortedExeption indeed given that registerMerge might throw such an exception. Maybe we should move this try block to registerMerge instead where we know which OneMerge is being registered (and is also where the exception is thrown when estimating the size of the merge).

I'm not really understanding this suggestion...

  1. If I understand the code correctly, the entire point of registerMerge() throwing MergeAbortedExeption is so it propogates up the call stack and aborts all the surrounding merge logic. Calling handleMergeException() (typically swallowing the exception) inside registerMerge() seems like it would completely defeat the point?
  2. Even if we can/should/do push a try+catch+handleMergeException() call down into registerMerge(), we still need to try+catch+tragicEvent() in maybeMerge() because of other code paths that can result in unrecoverable exceptions (example: maybeMerge() -> updatePendingMerges() -> mergePolicy.findMerges() -> low level file corruption exception)

Either way, this discussion feels like it's about a potential bug/change that's really orthoginal to the question in this jira of correctly dealing with tragic/corruption exceptions, and really seems like it should probably be tracked in it's own jira: "MergeAbortedExeption during prepareCommit()/maybeMerge() calls are not being tracked/recorded by handleMergeException()" ... correct?

For now I've updated the patch to take the simplest possible approach to checking for MergeAbortedException


I am not sure why we need to treat every exception as fatal in this case?

Well, to flip your question around: is there an example of a Throwable you can think of bubbling up out of IndexWriter.startCommit() that should NOT be considered fatal?

I based that change on the existing catch (Throwable t) { ... tragicEvent(t, "finishCommit") logic in IW.finishCommit() (and likewise the similar catch Throwable -> tragicEvent handling in IW.merge(OneMerge) ... which Adrien's original comment lead me to follow as an example of what to do in IW.maybeMerge()) .

To me, catching Throwable here makes sense because it seemed that anything that might get thrown by startCommit/finishCommit/maybeMerge is ultimately something related to the underlying index that is entirely out of the caller's control, so seems to be tragic by definition – but I freely admit to not being entirely confident about it and I'm absolutely open to suggestions about how to deal with these corruption cases while keeping the scope of the catch blocks as narrow as possible. Should we just focus on VirtualMachineError + IOException ?

(see also my nocommit comments about the existing tragicEvent() call in prepareCommitInternal() ... but that hadn't triggered any failures in the test so I hadn't touched it)


I also wonder if we could move this to a PR on github, iterations would be simpler and comments too.

I don't use github for lucene development – I track all contributions as patches in the official issue tracker for the project as recommended by our official guidelines : )  ... but i'll go ahead and create a jira/LUCENE-8692 branch if that will help you review.

asfimport commented 5 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

(see also my nocommit comments about the existing tragicEvent() call in prepareCommitInternal() ... but that hadn't triggered any failures in the test so I hadn't touched it)

I spoke too soon – beasting just turned up this interesting little situation...

hossman@tray:\~/lucene/dev/lucene/core [master] $ ant beast -Dbeast.iters=100 -Dtests.iters=100 -Dtestcase=TestStressIndexing2 -Dtests.method=testRandomCorruptionIsTragic\*
...
 [beaster] Beast round 34 results: /home/hossman/lucene/dev/lucene/build/core/test/34
  [beaster] The following error occurred while executing this line:
  [beaster] /home/hossman/lucene/dev/lucene/common-build.xml:1572: The following error occurred while executing this line:
  [beaster] /home/hossman/lucene/dev/lucene/common-build.xml:1099: There were test failures: 1 suite, 100 tests, 1 failure [seed: CABE666E4674CFB2]
  [beaster] Executing 1 suite with 1 JVM.
  [beaster] 
  [beaster] Started J0 PID(10111@localhost).
  [beaster]   2> NOTE: reproduce with: ant test  -Dtestcase=TestStressIndexing2 -Dtests.method=testRandomCorruptionIsTragic -Dtests.seed=CABE666E4674CFB2 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=cs -Dtests.timezone=America/Nipigon -Dtests.asserts=true -Dtests.file.encoding=UTF-8
  [beaster] [15:50:16.736] FAILURE 0.02s | TestStressIndexing2.testRandomCorruptionIsTragic {seed=[CABE666E4674CFB2:682DC0F2BA2A235F]} <<<
  [beaster]    > Throwable #1: java.lang.AssertionError: index update encountered throwable, but no tragic event recorded: java.lang.AssertionError
  [beaster]    >    at __randomizedtesting.SeedInfo.seed([CABE666E4674CFB2:682DC0F2BA2A235F]:0)
  [beaster]    >    at org.junit.Assert.fail(Assert.java:88)
  [beaster]    >    at org.junit.Assert.assertTrue(Assert.java:41)
  [beaster]    >    at org.junit.Assert.assertNotNull(Assert.java:712)
  [beaster]    >    at org.apache.lucene.index.TestStressIndexing2$CorruptibleIndexingThread.run(TestStressIndexing2.java:1019)
  [beaster]    >    at org.apache.lucene.index.TestStressIndexing2.testRandomCorruptionIsTragic(TestStressIndexing2.java:144)
  [beaster]    >    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
  [beaster]    >    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  [beaster]    >    at java.lang.reflect.Method.invoke(Method.java:498)
  [beaster]    >    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
  [beaster]    >    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
  [beaster]    >    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
  [beaster]    >    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
  [beaster]    >    at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  [beaster]    >    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  [beaster]    >    at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  [beaster]    >    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  [beaster]    >    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  [beaster]    >    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  [beaster]    >    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  [beaster]    >    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
  [beaster]    >    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
  [beaster]    >    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
  [beaster]    >    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
  [beaster]    >    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
  [beaster]    >    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
  [beaster]    >    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  [beaster]    >    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  [beaster]    >    at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  [beaster]    >    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  [beaster]    >    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  [beaster]    >    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  [beaster]    >    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  [beaster]    >    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  [beaster]    >    at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  [beaster]    >    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  [beaster]    >    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  [beaster]    >    at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  [beaster]    >    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  [beaster]    >    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  [beaster]    >    at java.lang.Thread.run(Thread.java:748)
  [beaster]    >    Suppressed: java.lang.AssertionError
  [beaster]    >        at org.apache.lucene.codecs.simpletext.SimpleTextDocValuesReader.getNumericNonIterator(SimpleTextDocValuesReader.java:184)
  [beaster]    >        at org.apache.lucene.codecs.simpletext.SimpleTextDocValuesReader.getNumeric(SimpleTextDocValuesReader.java:142)
  [beaster]    >        at org.apache.lucene.index.CodecReader.getNumericDocValues(CodecReader.java:137)
  [beaster]    >        at org.apache.lucene.index.ReadersAndUpdates$2.getNumeric(ReadersAndUpdates.java:373)
  [beaster]    >        at org.apache.lucene.codecs.simpletext.SimpleTextDocValuesWriter.addNumericField(SimpleTextDocValuesWriter.java:88)
  [beaster]    >        at org.apache.lucene.index.ReadersAndUpdates.handleDVUpdates(ReadersAndUpdates.java:368)
  [beaster]    >        at org.apache.lucene.index.ReadersAndUpdates.writeFieldUpdates(ReadersAndUpdates.java:570)
  [beaster]    >        at org.apache.lucene.index.ReaderPool.commit(ReaderPool.java:325)
  [beaster]    >        at org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3313)
  [beaster]    >        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3222)
  [beaster]    >        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3451)
  [beaster]    >        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3416)
  [beaster]    >        at org.apache.lucene.index.TestStressIndexing2$CorruptibleIndexingThread.run(TestStressIndexing2.java:1008)
  [beaster]    >        ... 36 more
  [beaster]   2> NOTE: leaving temporary files on disk at: /home/hossman/lucene/dev/lucene/build/core/test/J0/temp/lucene.index.TestStressIndexing2_CABE666E4674CFB2-001
  [beaster]   2> NOTE: test params are: codec=SimpleText, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@40b1325d), locale=cs, timezone=America/Nipigon
  [beaster]   2> NOTE: Linux 3.19.0-84-generic amd64/Oracle Corporation 1.8.0_144 (64-bit)/cpus=4,threads=1,free=203261928,total=249561088
  [beaster]   2> NOTE: All tests run in this JVM: [TestStressIndexing2]
  [beaster] 
  [beaster] Tests with failures [seed: CABE666E4674CFB2]:
  [beaster]   - org.apache.lucene.index.TestStressIndexing2.testRandomCorruptionIsTragic {seed=[CABE666E4674CFB2:682DC0F2BA2A235F]}

...I'm not sure how/why that assertion would have tripped let alone if/when AssertionErrors should be treated as tragic?

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit af9da98e87ee2a44532cd9baf2a1bc6e64865b0d in lucene-solr's branch refs/heads/jira/LUCENE-8692 from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=af9da98

LUCENE-8692: updates to TestStressIndexing2 demonstrating bug

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit 2b28d9d9d0cdc35dfdc202f23a9ef1d27b2a6083 in lucene-solr's branch refs/heads/jira/LUCENE-8692 from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2b28d9d

LUCENE-8692: add additional logging to Solr test that first surfaced bug

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit 4bd7d6810dbf50ae8f07c1051378f482ff953073 in lucene-solr's branch refs/heads/jira/LUCENE-8692 from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4bd7d68

LUCENE-8692: initial attempt at fixing bug

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit e93313d42c4e5642e58707b74ed119776086b590 in lucene-solr's branch refs/heads/jira/LUCENE-8692 from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e93313d

LUCENE-8692: fix file leak in MockRandomPostingsFormat that surfaces when corrupting files

asfimport commented 5 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

For now I've updated the patch to take the simplest possible approach to checking for MergeAbortedException

+1

Well, to flip your question around: is there an example of a Throwable you can think of bubbling up out of IndexWriter.startCommit() that should NOT be considered fatal?

I think we need to be careful here. From my perspective there are 3 types of exceptions here:

Those are in particular:

Those methods might cause documents go missing etc. but we treated them not as fatal or tragic events since a user could always call rollback() to go back the the last known safe-point / previous commit. Now we can debate if we want to change this and we can, in-fact I am all for making it even more strict especially since it's inconsistent with what we do if addDocument fails with an aborting exception. If we do that we need to see if rollback still has a purpose and maybe remove it?

now speaking of maybeMerge I don't see why we need to close the index writer with a tragic event, there is no dataloss nor an inconsistency? From that logic I don't think we need to handle these exceptions in such a drastic way?

I don't use github for lucene development – I track all contributions as patches in the official issue tracker for the project as recommended by our official guidelines : ) ... but i'll go ahead and create a jira/LUCENE-8692 branch if that will help you review.

Bummer, I am not sure branches help. Working like it's still 1999 is a pain we should fix our guidelines.

asfimport commented 5 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I think we need to be careful here. From my perspective there are 3 types of exceptions here:

Right .... ok. This is definitely more nuanced then I originally realized, I was lumping #2 and #3 into the same bucket in my head, and hadn't even remembered/considered the "rollback" situation.

FWIW: My "mental model" when I first raised this issue was:

But your (IIUC) point is that in the general case, there is no reason some of these code paths (like maybeMerge & startCommit) can/should assume that any exceptions bubbling up are in fact unrecoverable / traggic – in this test it may be, but the code doesn't know that: it's very possible that a rollback will work.


Now we can debate if we want to change this and we can, in-fact I am all for making it even more strict especially since it's inconsistent with what we do if addDocument fails with an aborting exception.

It definitely seems like there should be something we can/should do to better recognize situations like this as "unrecoverable" and be more strict in dealing with low level exceptions during things like commit – but I'm out definitely out of my depth in understanding/suggesting what that might look like.

Clearly the current patch is being too aggressive in what it treats as "tragic".

asfimport commented 5 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

It definitely seems like there should be something we can/should do to better recognize situations like this as "unrecoverable" and be more strict in dealing with low level exceptions during things like commit – but I'm out definitely out of my depth in understanding/suggesting what that might look like.

I agree with you here, I personally question the purpose of rollback since all the cases I have seen a missing rollback would simply mean dataloss. if somebody continues after a failed commit / prepareCommit / reopen they will end up with inconsistency and / or dataloss. I can't think of a reason why you would want to do it. I am curious what @mikemccand @jpountz @rmuir think about that. If we deprecated and remove rollback() we can be more agressive when it gets to tragic events and prevent users from continuing after such an exception by closing the writer automatically.

asfimport commented 5 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

rollback gives you a way to close IndexWriter without doing a commit, which seems useful.  If you removed that, what would users do instead?

asfimport commented 5 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

> rollback gives you a way to close IndexWriter without doing a commit, which seems useful. If you removed that, what would users do instead?

can't we expend close to close without commit? I mean we can keep rollback but bet more strict about exceptions during the commit and friends?