apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

Add MDW.enableVirusScanner / fix windows handling bugs [LUCENE-5904] #6966

Closed asfimport closed 10 years ago

asfimport commented 10 years ago

IndexWriter has logic to handle the case where it can't delete a file (it puts in a retry list and indexfiledeleter will periodically retry, you can force this retry with deletePendingFiles).

But from what I can tell, this logic is incomplete, e.g. its not properly handled during CFS creation, so if a file temporarily can't be deleted things like flush will fail.


Migrated from LUCENE-5904 by Robert Muir (@rmuir), resolved Sep 19 2014 Attachments: LUCENE-5904.patch (versions: 7)

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here is a patch that just adds the logic to MDW. Some of the fails are false: e.g. tests directly against IFD or directory (These can just disable the option). But some, e.g. the CFS creation fails, are real.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Patch fixing 3 bugs so far (Lucene40SIWriter, Lucene46SIWriter, CompoundFileWriter). There might be more bugs: we should review all uses of Directory.deleteFile to make sure we are doing the right thing.

I also fixed up core tests that currently rely upon e.g. unref'ed files check or manipulate files directly to disable the option.

I may have made a mistake or unconvered something in disk full test that i havent investigated yet:

   [junit4] Suite: org.apache.lucene.index.TestIndexWriterOnDiskFull
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriterOnDiskFull -Dtests.method=testImmediateDiskFull -Dtests.seed=2D75D397EE0B3214 -Dtests.locale=ga_IE -Dtests.timezone=Europe/Lisbon -Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   0.20s | TestIndexWriterOnDiskFull.testImmediateDiskFull <<<
   [junit4]    > Throwable #1: java.io.EOFException: read past EOF: RAMInputStream(name=segments_1)
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([2D75D397EE0B3214:BC3341FB920ADCD0]:0)
   [junit4]    >    at org.apache.lucene.store.RAMInputStream.switchCurrentBuffer(RAMInputStream.java:98)
   [junit4]    >    at org.apache.lucene.store.RAMInputStream.readByte(RAMInputStream.java:71)
   [junit4]    >    at org.apache.lucene.store.MockIndexInputWrapper.readByte(MockIndexInputWrapper.java:122)
   [junit4]    >    at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
   [junit4]    >    at org.apache.lucene.store.DataInput.readInt(DataInput.java:98)
   [junit4]    >    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:356)
   [junit4]    >    at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:463)
   [junit4]    >    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:804)
   [junit4]    >    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:650)
   [junit4]    >    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:459)
   [junit4]    >    at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:398)
   [junit4]    >    at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:207)
   [junit4]    >    at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:711)
   [junit4]    >    at org.apache.lucene.index.TestIndexWriterOnDiskFull.testImmediateDiskFull(TestIndexWriterOnDiskFull.java:569)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)

Maybe something isn't quite right in the windows handlign there?

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I like the virus scanner, just doing nothing - only holding files open :-) I wish all virus scanner would do nothing!

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1, this is a nice new evilness for MDW!

But that EOFE is terrifying?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

If its because it writes incomplete commit and won't fall back to "nothing", maybe instead of weakening the test we can just do an empty commit up front so we still keep coverage.

Its worth it for the disk full tests, i dont really want them lenient.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK the EOFE is "just" because the very first commit is corrupt.

It happens with this seed because MDW throws an exc when IW is writing segments_1, and then IW tries to remove segments_1 and MDW throws another exception (new virus checker in this patch) and so a corrupt segments_1 is left. If there were a prior commit, then at read time we would fall back to it.

So net/net I don't think there's anything to fix here, except +1 to just have the test make an empty first commit (before any MDW exceptions are enabled).

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Updated patch. I also fixed a few more false fails.

Still in general, there are interesting failures every time you run core tests with the patch. testThreadInterruptDeadlock got angry because write.lock couldn't be removed, need to investigate that deletion further.

I also havent looked at this:

   [junit4] Suite: org.apache.lucene.index.TestCodecHoldsOpenFiles
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestCodecHoldsOpenFiles -Dtests.method=test -Dtests.seed=1908AF7C5FA5D64A -Dtests.locale=sl -Dtests.timezone=Asia/Bangkok -Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   0.01s J1 | TestCodecHoldsOpenFiles.test <<<
   [junit4]    > Throwable #1: java.io.FileNotFoundException: segments_1 in dir=RAMDirectory@25ac448 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@150c542d
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([1908AF7C5FA5D64A:915C90A6F159BBB2]:0)
   [junit4]    >    at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:593)
   [junit4]    >    at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:106)
   [junit4]    >    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:347)
   [junit4]    >    at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:458)
   [junit4]    >    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:794)
   [junit4]    >    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:640)
   [junit4]    >    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:454)
   [junit4]    >    at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:398)
asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think its just the long tail left now:

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I fixed the CodecHoldsOpenFiles test. It was buggy before: it might not even ever commit due to close randomization, and the checkIndex at the end would just do nothing.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620340 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620340

LUCENE-5904: make branch

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620341 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620341

LUCENE-5904: current patch

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620342 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620342

LUCENE-5904: MDW confesses when virus checker kicks in, if you run verbose

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620343 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620343

LUCENE-5904: fix false test failure

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620418 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620418

LUCENE-5904: fix false fails

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620421 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620421

LUCENE-5904: fix false fail

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Another real bug i think, in IndexFileDeleter (found by TestCrash)

IW crashes or something, and we have some leftover files (like _0.si, imagine from an initial empty commit).

when we bootup a new IW, it tries to delete the trash, but for some reason temporarily cannot delete _0.si. Then we go and flush real segment _0, only afterwards IFD comes back around and deletes _0.si, which is now a legit file, corrupting the index.

Its caused by the filename reuse problem (#6965).

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620451 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620451

LUCENE-5904: improve debuggability on fail

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620575 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620575

LUCENE-5904: fix replicator tests

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620576 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620576

LUCENE-5904: fix false failures

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620580 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620580

LUCENE-5904: fix fails

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620582 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620582

LUCENE-5904: fix false fails

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620584 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620584

LUCENE-5904: fix false fail

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620601 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620601

LUCENE-5904: add explicit test

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620753 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620753

LUCENE-5904: first cut at gen inflation to prevent index corruption when files are re-used after first IW has unclean shutdown and 2nd IW encounters virus checker

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620755 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620755

LUCENE-5904: set MDW back to try to provoke more fails

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620756 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620756

LUCENE-5904: remove sop

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620777 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620777

LUCENE-5904: don't parse segments.gen as a segments file and overinflate

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620778 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620778

LUCENE-5904: test segments inflation

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620779 from @rmuir in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620779

LUCENE-5904: add tests for segment/gen inflation, fix gen inflation (it always parsed gen of 0), fix off-by-one in gen-inflation (was starting at 2 instead of 1), add robustness to trash files we might be looking at, add minor restriction to segment suffix names so we can always parse generations correctly from them

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620802 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620802

LUCENE-5904: address last nocommit

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620820 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620820

LUCENE-5904: fix some false false

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1620852 from @mikemccand in branch 'dev/branches/lucene5904' https://svn.apache.org/r1620852

LUCENE-5904: merge trunk

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch from diffSources (trunk vs branch); I think it's ready. I ran 52 iters of distributed beasting (all lucene core + modules tests) ...

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

To summarize this issue:

First Rob added nice new evilness to MDW so that sometimes Directory.deleteFile would fail even if Lucene did not have that file open, simulating a virus checker temporarily holding the file open. (Previously this operation would always succeed).

But this new evilness uncovered a nasty corruption case in Lucene, whereby 1) an unclean shutdown of a previous IW left some "future" segment files in the index, e.g. _5.pos, 2) the new IW starts up and identifies this file as not being referenced and immediately tries to delete it, but 3) the virus checker prevents _5.pos being deleted on init. Normally this is "ok": IW records that this file needs deleting but failed last time and so it periodically retries.

The problem is, when the IW goes and flushes a few segments, it may now in fact overwrite _5.pos with a "real" one, which may succeed (if virus checker is done with that file), and then later when IW retries its deletes, it removes _5.pos, corrupting the index.

I don't know of any actual user cases showing this corruption case ... but it's quite insidious ...

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, just adding a check & assert in IFD that it should never delete a pending file that has a non-zero refCount.

I think it's ready ... I'll commit later today ...

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1621389 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1621389

LUCENE-5904: fix corruption case caused by virus checker after an unclean IW shutdown

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1621392 from @mikemccand in branch 'dev/branches/branch_4x' https://svn.apache.org/r1621392

LUCENE-5904: fix corruption case caused by virus checker after an unclean IW shutdown

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1621421 from @mikemccand in branch 'dev/branches/branch_4x' https://svn.apache.org/r1621421

LUCENE-5904: fix test to make empty initial commit

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1621422 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1621422

LUCENE-5904: fix test to make empty initial commit

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1621423 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1621423

LUCENE-5904: fix redundant cast warning

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1621424 from @mikemccand in branch 'dev/branches/branch_4x' https://svn.apache.org/r1621424

LUCENE-5904: fix redundant cast warning

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1621428 from @mikemccand in branch 'dev/branches/branch_4x' https://svn.apache.org/r1621428

LUCENE-5904: fix false fail

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1621429 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1621429

LUCENE-5904: fix false fail

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Reopen to backport to 4.10.1

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1626204 from @mikemccand in branch 'dev/branches/lucene_solr_4_10' https://svn.apache.org/r1626204

LUCENE-5904: backport to 4.10.1

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Bulk close for Lucene/Solr 4.10.1 release