Closed asfimport closed 10 years ago
Robert Muir (@rmuir) (migrated from JIRA)
Here is a patch that just adds the logic to MDW. Some of the fails are false: e.g. tests directly against IFD or directory (These can just disable the option). But some, e.g. the CFS creation fails, are real.
Robert Muir (@rmuir) (migrated from JIRA)
Patch fixing 3 bugs so far (Lucene40SIWriter, Lucene46SIWriter, CompoundFileWriter). There might be more bugs: we should review all uses of Directory.deleteFile to make sure we are doing the right thing.
I also fixed up core tests that currently rely upon e.g. unref'ed files check or manipulate files directly to disable the option.
I may have made a mistake or unconvered something in disk full test that i havent investigated yet:
[junit4] Suite: org.apache.lucene.index.TestIndexWriterOnDiskFull
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterOnDiskFull -Dtests.method=testImmediateDiskFull -Dtests.seed=2D75D397EE0B3214 -Dtests.locale=ga_IE -Dtests.timezone=Europe/Lisbon -Dtests.file.encoding=ISO-8859-1
[junit4] ERROR 0.20s | TestIndexWriterOnDiskFull.testImmediateDiskFull <<<
[junit4] > Throwable #1: java.io.EOFException: read past EOF: RAMInputStream(name=segments_1)
[junit4] > at __randomizedtesting.SeedInfo.seed([2D75D397EE0B3214:BC3341FB920ADCD0]:0)
[junit4] > at org.apache.lucene.store.RAMInputStream.switchCurrentBuffer(RAMInputStream.java:98)
[junit4] > at org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:71)
[junit4] > at org.apache.lucene.store.MockIndexInputWrapper.readByte(MockIndexInputWrapper.java:122)
[junit4] > at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
[junit4] > at org.apache.lucene.store.DataInput.readInt(DataInput.java:98)
[junit4] > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:356)
[junit4] > at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:463)
[junit4] > at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:804)
[junit4] > at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:650)
[junit4] > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:459)
[junit4] > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:398)
[junit4] > at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:207)
[junit4] > at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:711)
[junit4] > at org.apache.lucene.index.TestIndexWriterOnDiskFull.testImmediateDiskFull(TestIndexWriterOnDiskFull.java:569)
[junit4] > at java.lang.Thread.run(Thread.java:745)
Maybe something isn't quite right in the windows handlign there?
Uwe Schindler (@uschindler) (migrated from JIRA)
I like the virus scanner, just doing nothing - only holding files open :-) I wish all virus scanner would do nothing!
Michael McCandless (@mikemccand) (migrated from JIRA)
+1, this is a nice new evilness for MDW!
But that EOFE is terrifying?
Robert Muir (@rmuir) (migrated from JIRA)
If its because it writes incomplete commit and won't fall back to "nothing", maybe instead of weakening the test we can just do an empty commit up front so we still keep coverage.
Its worth it for the disk full tests, i dont really want them lenient.
Michael McCandless (@mikemccand) (migrated from JIRA)
OK the EOFE is "just" because the very first commit is corrupt.
It happens with this seed because MDW throws an exc when IW is writing segments_1, and then IW tries to remove segments_1 and MDW throws another exception (new virus checker in this patch) and so a corrupt segments_1 is left. If there were a prior commit, then at read time we would fall back to it.
So net/net I don't think there's anything to fix here, except +1 to just have the test make an empty first commit (before any MDW exceptions are enabled).
Robert Muir (@rmuir) (migrated from JIRA)
Updated patch. I also fixed a few more false fails.
Still in general, there are interesting failures every time you run core tests with the patch. testThreadInterruptDeadlock got angry because write.lock couldn't be removed, need to investigate that deletion further.
I also havent looked at this:
[junit4] Suite: org.apache.lucene.index.TestCodecHoldsOpenFiles
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestCodecHoldsOpenFiles -Dtests.method=test -Dtests.seed=1908AF7C5FA5D64A -Dtests.locale=sl -Dtests.timezone=Asia/Bangkok -Dtests.file.encoding=ISO-8859-1
[junit4] ERROR 0.01s J1 | TestCodecHoldsOpenFiles.test <<<
[junit4] > Throwable #1: java.io.FileNotFoundException: segments_1 in dir=RAMDirectory@25ac448 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@150c542d
[junit4] > at __randomizedtesting.SeedInfo.seed([1908AF7C5FA5D64A:915C90A6F159BBB2]:0)
[junit4] > at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:593)
[junit4] > at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:106)
[junit4] > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:347)
[junit4] > at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:458)
[junit4] > at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:794)
[junit4] > at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:640)
[junit4] > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:454)
[junit4] > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:398)
Robert Muir (@rmuir) (migrated from JIRA)
I think its just the long tail left now:
Robert Muir (@rmuir) (migrated from JIRA)
I fixed the CodecHoldsOpenFiles test. It was buggy before: it might not even ever commit due to close randomization, and the checkIndex at the end would just do nothing.
ASF subversion and git services (migrated from JIRA)
Commit 1620340 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620340
LUCENE-5904: make branch
ASF subversion and git services (migrated from JIRA)
Commit 1620341 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620341
LUCENE-5904: current patch
ASF subversion and git services (migrated from JIRA)
Commit 1620342 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620342
LUCENE-5904: MDW confesses when virus checker kicks in, if you run verbose
ASF subversion and git services (migrated from JIRA)
Commit 1620343 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620343
LUCENE-5904: fix false test failure
ASF subversion and git services (migrated from JIRA)
Commit 1620418 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620418
LUCENE-5904: fix false fails
ASF subversion and git services (migrated from JIRA)
Commit 1620421 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620421
LUCENE-5904: fix false fail
Robert Muir (@rmuir) (migrated from JIRA)
Another real bug i think, in IndexFileDeleter (found by TestCrash)
IW crashes or something, and we have some leftover files (like _0.si, imagine from an initial empty commit).
when we bootup a new IW, it tries to delete the trash, but for some reason temporarily cannot delete _0.si. Then we go and flush real segment _0, only afterwards IFD comes back around and deletes _0.si, which is now a legit file, corrupting the index.
Its caused by the filename reuse problem (#6965).
ASF subversion and git services (migrated from JIRA)
Commit 1620451 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620451
LUCENE-5904: improve debuggability on fail
ASF subversion and git services (migrated from JIRA)
Commit 1620575 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620575
LUCENE-5904: fix replicator tests
ASF subversion and git services (migrated from JIRA)
Commit 1620576 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620576
LUCENE-5904: fix false failures
ASF subversion and git services (migrated from JIRA)
Commit 1620580 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620580
LUCENE-5904: fix fails
ASF subversion and git services (migrated from JIRA)
Commit 1620582 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620582
LUCENE-5904: fix false fails
ASF subversion and git services (migrated from JIRA)
Commit 1620584 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620584
LUCENE-5904: fix false fail
ASF subversion and git services (migrated from JIRA)
Commit 1620601 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620601
LUCENE-5904: add explicit test
ASF subversion and git services (migrated from JIRA)
Commit 1620753 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620753
LUCENE-5904: first cut at gen inflation to prevent index corruption when files are re-used after first IW has unclean shutdown and 2nd IW encounters virus checker
ASF subversion and git services (migrated from JIRA)
Commit 1620755 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620755
LUCENE-5904: set MDW back to try to provoke more fails
ASF subversion and git services (migrated from JIRA)
Commit 1620756 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620756
LUCENE-5904: remove sop
ASF subversion and git services (migrated from JIRA)
Commit 1620777 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620777
LUCENE-5904: don't parse segments.gen as a segments file and overinflate
ASF subversion and git services (migrated from JIRA)
Commit 1620778 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620778
LUCENE-5904: test segments inflation
ASF subversion and git services (migrated from JIRA)
Commit 1620779 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620779
LUCENE-5904: add tests for segment/gen inflation, fix gen inflation (it always parsed gen of 0), fix off-by-one in gen-inflation (was starting at 2 instead of 1), add robustness to trash files we might be looking at, add minor restriction to segment suffix names so we can always parse generations correctly from them
ASF subversion and git services (migrated from JIRA)
Commit 1620802 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620802
LUCENE-5904: address last nocommit
ASF subversion and git services (migrated from JIRA)
Commit 1620820 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620820
LUCENE-5904: fix some false false
ASF subversion and git services (migrated from JIRA)
Commit 1620852 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620852
LUCENE-5904: merge trunk
Michael McCandless (@mikemccand) (migrated from JIRA)
Patch from diffSources (trunk vs branch); I think it's ready. I ran 52 iters of distributed beasting (all lucene core + modules tests) ...
Michael McCandless (@mikemccand) (migrated from JIRA)
To summarize this issue:
First Rob added nice new evilness to MDW so that sometimes Directory.deleteFile would fail even if Lucene did not have that file open, simulating a virus checker temporarily holding the file open. (Previously this operation would always succeed).
But this new evilness uncovered a nasty corruption case in Lucene, whereby 1) an unclean shutdown of a previous IW left some "future" segment files in the index, e.g. _5.pos, 2) the new IW starts up and identifies this file as not being referenced and immediately tries to delete it, but 3) the virus checker prevents _5.pos being deleted on init. Normally this is "ok": IW records that this file needs deleting but failed last time and so it periodically retries.
The problem is, when the IW goes and flushes a few segments, it may now in fact overwrite _5.pos with a "real" one, which may succeed (if virus checker is done with that file), and then later when IW retries its deletes, it removes _5.pos, corrupting the index.
I don't know of any actual user cases showing this corruption case ... but it's quite insidious ...
Michael McCandless (@mikemccand) (migrated from JIRA)
New patch, just adding a check & assert in IFD that it should never delete a pending file that has a non-zero refCount.
I think it's ready ... I'll commit later today ...
ASF subversion and git services (migrated from JIRA)
Commit 1621389 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1621389
LUCENE-5904: fix corruption case caused by virus checker after an unclean IW shutdown
ASF subversion and git services (migrated from JIRA)
Commit 1621392 from @mikemccand in branch 'dev/branches/branch_4x' https://svn.apache.org/r1621392
LUCENE-5904: fix corruption case caused by virus checker after an unclean IW shutdown
ASF subversion and git services (migrated from JIRA)
Commit 1621421 from @mikemccand in branch 'dev/branches/branch_4x' https://svn.apache.org/r1621421
LUCENE-5904: fix test to make empty initial commit
ASF subversion and git services (migrated from JIRA)
Commit 1621422 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1621422
LUCENE-5904: fix test to make empty initial commit
ASF subversion and git services (migrated from JIRA)
Commit 1621423 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1621423
LUCENE-5904: fix redundant cast warning
ASF subversion and git services (migrated from JIRA)
Commit 1621424 from @mikemccand in branch 'dev/branches/branch_4x' https://svn.apache.org/r1621424
LUCENE-5904: fix redundant cast warning
ASF subversion and git services (migrated from JIRA)
Commit 1621428 from @mikemccand in branch 'dev/branches/branch_4x' https://svn.apache.org/r1621428
LUCENE-5904: fix false fail
ASF subversion and git services (migrated from JIRA)
Commit 1621429 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1621429
LUCENE-5904: fix false fail
Michael McCandless (@mikemccand) (migrated from JIRA)
Reopen to backport to 4.10.1
ASF subversion and git services (migrated from JIRA)
Commit 1626204 from @mikemccand in branch 'dev/branches/lucene_solr_4_10' https://svn.apache.org/r1626204
LUCENE-5904: backport to 4.10.1
Michael McCandless (@mikemccand) (migrated from JIRA)
Bulk close for Lucene/Solr 4.10.1 release
IndexWriter has logic to handle the case where it can't delete a file (it puts in a retry list and indexfiledeleter will periodically retry, you can force this retry with deletePendingFiles).
But from what I can tell, this logic is incomplete, e.g. its not properly handled during CFS creation, so if a file temporarily can't be deleted things like flush will fail.
Migrated from LUCENE-5904 by Robert Muir (@rmuir), resolved Sep 19 2014 Attachments: LUCENE-5904.patch (versions: 7)