Closed asfimport closed 7 months ago
Michael McCandless (@mikemccand) (migrated from JIRA)
+1, let's not be heroic here: IW's exc handling is terrifying.
Michael McCandless (@mikemccand) (migrated from JIRA)
I'll try to tackle this (make abort a tragedy): it's a ridiculous situation, today, that IW can throw an exception and it turns out that exception silently just deleted tons of previously indexed documents and when you close/commit the IW, they are gone.
Michael McCandless (@mikemccand) (migrated from JIRA)
Patch. I added new AbortIndexWriterException; it is package private, and is only used to communicate internally in IndexWriter (other approaches I tried kept leading to deadlock). I removed the "andReset" from DocumentsWriterPerThread.checkAndResetHasAborted: I think it's dangerous to ever set hasAborted back to false. It's like unplugging your carbon monoxide detector because it's making too much noise.
This is an important behavior change.
First off, an aborting exception, which is an exception that strikes at a "bad" time (e.g. when appending to stored fields or term vector files) such that the entire segment is now unusable, will now forcefully close the IndexWriter. This is so you know you lost any uncommitted data.
Second off, when this happens, I don't think IndexWriter should try to be so crazy about deleting every last unref'd file when disaster strikes: if your house is burning down, you don't worry about washing the dirty dishes.
This change made a number of tests angry (IW suddenly closing, and also leaving unref'd files), and I did distributed beasting to try to ferret them out, but I expect we'll have a long tail of Jenkins failures after committing this.
Robert Muir (@rmuir) (migrated from JIRA)
awesome!
Can we shorten the name of the exception? I don't like AbortIndexWriterException for an internal one, i think its too complicated. Can it either be AbortingException or IndexWriter.AbortingException? Can we use try-multi-catch instead of java 1.6-style? I will take a deeper look after coffee.
Robert Muir (@rmuir) (migrated from JIRA)
Here is patch with my suggestions. I didn't like having two tragicEvent methods on IndexWriter, to me thats confusing. so the single one just unboxes AbortingException and there is one codepath to worry about there, and for each case.
I am still confused about the logic in various places in DocumentsWriter (I didnt try to tackle this here). It seems wierd we are both catching exception/handlign stuff in finally block but then "asking" the dwpt if it has aborted. Is there a cleaner way?
I am happy AbortingException is a checked one, its a good use here. Its fine if it nests in IndexWriter.java to be clear too. But its already a terminology used in the codebase so i think its the right name.
Robert Muir (@rmuir) (migrated from JIRA)
What is happening with TestIndexWriterAbort? I only added that recently to test we were doing the right thing so that abort() could be removed from the codec API (#7144).
But with this patch, the unreferenced files check is disabled (and should be?), so the patch doesnt offer anything that TestIndexWriterExceptions2 isn't already checking.
What is happening with TestIndexWriterDelete.testErrorInDocsWriterAdd(), it still causes aborting exceptions and checks that unreferenced files are removed. Why is it passing?
Michael McCandless (@mikemccand) (migrated from JIRA)
New patch, iterating from Rob's last patch.
I cleaned up the exc handling somewhat, and fixed the places that "want to abort" to simply throw AbortingException instead of settings confusing booleans which are checked in finally clauses up the stack.
Tests seem to be passing under some distributed beasting ...
Robert Muir (@rmuir) (migrated from JIRA)
Overall looks good. I have not nitpicked, but we could stare at it for days and maybe not see bugs. I would rather us get it in jenkins sooner and iterate on adding more testing around this to really harden indexwriter for good, and not feel so scared in the future when trying to do cleanups.
What is IgnoreAlreadyClosedExceptionConcurrentMergeScheduler? This is a little awkward and I can't figure out why we need it, or if we could do something else better instead.
Michael McCandless (@mikemccand) (migrated from JIRA)
What is IgnoreAlreadyClosedExceptionConcurrentMergeScheduler?
It's for tests that use CMS and also abort the IW, since CMS can hit ACE in such cases and it's "ok".
We will likely need to use it in more tests...
ASF subversion and git services (migrated from JIRA)
Commit 1643432 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1643432
LUCENE-5987: IndexWriter forcefully closes itself on hitting aborting exception
ASF subversion and git services (migrated from JIRA)
Commit 1643466 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1643466
LUCENE-5987: IndexWriter forcefully closes itself on hitting aborting exception
ASF subversion and git services (migrated from JIRA)
Commit 1643661 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1643661
LUCENE-5987: remove dead code
ASF subversion and git services (migrated from JIRA)
Commit 1644072 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1644072
LUCENE-5987: don't double-wrap AbortingException
ASF subversion and git services (migrated from JIRA)
Commit 1644075 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1644075
LUCENE-5987: don't double-wrap AbortingException
IndexWriter's exception handling is overly complicated. Every method in general reads like this:
Part of the problem is it acts like its an invincible superhero, e.g. can take a disk full on merge or flush to the face and just keep on trucking, and you can somehow fix the root cause and then just go about making commits on the same instance.
But we have a hard enough time ensuring exceptions dont do the wrong thing (e.g. cause corruption), and I don't think we really test this crazy behavior anywhere: e.g. making commits AFTER hitting disk full and so on.
It would probably be simpler if when such things happen, IW just considered them "tragic" just like OOM and rolled itself back, instead of doing all kinds of really scary stuff to try to "keep itself healthy" (like the little dance it plays with IFD in mergeMiddle manually deleting CFS files).
Besides, without something like a WAL, Indexwriter isn't really fit to be a superhero anyway: it can't prevent you from losing data in such situations. It just doesn't have the right tools for the job.
edit: just to be clear I am referring to abort (low level exception during flush) and exceptions during merge. For simple non-aborting cases like analyzer errors, of course we can deal with this. We already made great progress on turning a lot of BS exceptions that would cause aborts into non-aborting ones recently.
Migrated from LUCENE-5987 by Robert Muir (@rmuir), updated May 09 2016 Attachments: LUCENE-5987.patch (versions: 3)