apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

DirectoryReader.isCurrent might fail to see the segments file during concurrent index changes [LUCENE-2585] #3659

Open asfimport opened 14 years ago

asfimport commented 14 years ago

I could reproduce the issue several times but only by running long and stressfull benchmarks, the high number of files is likely part of the scenario. All tests run on local disk, using ext3.

Sample stacktrace:

java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.NIOFSDirectory@/home/sanne/infinispan-41/lucene-directory/tempIndexName: files:
_2l3.frq _uz.fdt _1q4.fnm _1q0.fdx _4bc.fdt _v2.tis _4ll.fdx _2l8.tii _ux.fnm _3g7.fdx _4bb.tii _4bj.prx _uy.fdx _3g7.prx _2l7.frq _2la.fdt _3ge.nrm _2l6.prx 
_1py.fdx _3g6.nrm _v0.prx _4bi.tii _2l2.tis _v2.fdx _2l3.nrm _2l8.fnm _4bg.tis _2la.tis _uu.fdx _3g6.fdx _1q3.frq _2la.frq _4bb.tis _3gb.tii _1pz.tis 
_2lb.nrm _4lm.nrm _3g9.tii _v0.fdt _2l5.fnm _v2.prx _4ll.tii _4bd.nrm _2l7.fnm _2l4.nrm _1q2.tis _3gb.fdx _4bh.fdx _1pz.nrm _ux.fdx _ux.tii _1q6.nrm 
_3gf.fdx _4lk.fdt _3gd.nrm _v3.fnm _3g8.prx _1q2.nrm _4bh.prx _1q0.frq _ux.fdt _1q7.fdt _4bb.fnm _4bf.nrm _4bc.nrm _3gb.fdt _4bh.fnm _2l5.tis 
_1pz.fnm _1py.fnm _3gc.fnm _2l2.prx _2l4.frq _3gc.fdt _ux.tis _1q3.prx _2l7.fdx _4bj.nrm _4bj.fdx _4bi.tis _3g9.prx _1q4.prx _v3.fdt _1q3.fdx _2l9.fdt 
_4bh.tis _3gb.nrm _v2.nrm _3gd.tii _2l7.nrm _2lb.tii _4lm.tis _3ga.fdx _1pz.fdt _3g7.fnm _2l3.fnm _4lk.fnm _uz.fnm _2l2.frq _4bd.fdx _1q2.fdt _3g7.tis 
_4bi.frq _4bj.frq _2l7.prx _ux.prx _3gd.fnm _1q4.fdt _1q1.fdt _v1.fnm _1py.nrm _3gf.nrm _4be.fdt _1q3.tii _1q1.prx _2l3.fdt _4lk.frq _2l4.fdx _4bd.fnm 
_uw.frq _3g8.fdx _2l6.tii _1q5.frq _1q5.tis _3g8.nrm _uw.nrm _v0.tii _v2.fdt _2l7.fdt _v0.tis _uy.tii _3ge.tii _v1.tii _3gb.tis _4lm.fdx _4bc.fnm _2lb.frq 
_2l6.fnm _3g6.tii _3ge.prx _uu.frq _1pz.fdx _1q2.fnm _4bi.prx _3gc.frq _2l9.tis _3ge.fdt _uy.fdt _4ll.fnm _3gc.prx _1q7.tii _2l5.nrm _uy.nrm _uv.frq 
_1q6.frq _4ba.tis _3g9.tis _4be.nrm _4bi.fnm _ux.frq _1q1.fnm _v0.fnm _2l4.fnm _4ba.fnm _4be.tis _uz.prx _1q6.fdx _uw.tii _2l6.nrm _1pz.prx _2l7.tis 
_1q7.fdx _2l9.tii _4lk.tii _uz.frq _3g8.frq _4bb.prx _1q5.tii _1q5.prx _v2.frq _4bc.tii _1q7.prx _v2.tii _2lb.tis _4bi.fdt _uv.nrm _2l2.fnm _4bd.tii _1q7.tis 
_4bg.fnm _3ga.frq _uu.fnm _2l9.fnm _3ga.fnm _uw.fnm _1pz.frq _1q1.fdx _3ge.fdx _2l3.prx _3ga.nrm _uv.fdt _4bb.nrm _1q7.fnm _uv.tis _3gb.fnm 
_2l6.tis _1pz.tii _uy.fnm _3gf.fdt _3gc.nrm _4bf.tis _1q5.fnm _uu.tis _4bh.tii _2l5.fdt _1q6.tii _4bc.tis _3gc.tii _3g9.fnm _2l6.fdt _4bj.fnm _uu.tii _v3.frq 
_3g9.fdx _v0.nrm _2l7.tii _1q0.fdt _3ge.fnm _4bf.fdt _1q6.prx _uz.nrm _4bi.fdx _3gf.fnm _4lm.frq _v0.fdx _4ba.fdt _1py.tii _4bf.tii _uw.fdx _2l5.frq 
_3g9.nrm _v1.fdt _uw.fdt _4bd.frq _4bg.prx _3gd.tis _1q4.tis _2l9.nrm _2la.nrm _v3.tii _4bf.prx _1q1.nrm _4ba.tii _3gd.fdx _1q4.tii _4lm.tii _3ga.tis 
_4bf.fnm write.lock _2l8.prx _2l8.fdt segments.gen _2lb.fnm _2l4.fdt _1q2.prx _4be.fnm _3gf.prx _2l6.fdx _3g6.fnm _4bb.fdt _4bd.tis _4lk.nrm _2l5.fdx 
_2la.tii _4bd.prx _4ln.fnm _3gf.tis _4ba.nrm _v3.prx _uv.prx _1q3.fnm _3ga.tii _uz.tii _3g9.frq _v0.frq _3ge.tis _3g6.tis _4ln.prx _3g7.tii _3g8.fdt 
_3g7.nrm _3ga.prx _2l2.fdx _2l8.fdx _4ba.prx _1py.frq _uz.fdx _2l3.tii _3g6.prx _v3.fdx _1q6.fdt _v1.nrm _2l2.tii _1q0.tis _4ba.fdx _4be.tii _4ba.frq 
_4ll.fdt _4bh.nrm _4lm.fdt _1q7.frq _4lk.tis _4bc.frq _1q6.fnm _3g7.frq _uw.tis _3g8.tis _2l9.fdx _2l4.tii _1q4.fdx _4be.prx _1q3.nrm _1q0.tii _1q0.fnm 
_v3.nrm _1py.tis _3g9.fdt _4bh.fdt _4ll.nrm _4lk.prx _3gd.prx _1q3.tis _1q2.tii _2l2.nrm _3gd.fdt _2l3.fdx _3g6.fdt _3gd.frq _1q1.tis _4bb.fdx _1q2.frq 
_1q3.fdt _v1.tis _2l8.frq _3gc.fdx _1q1.frq _4bg.frq _4bb.frq _2la.fdx _2l9.frq _uy.tis _uy.prx _4bg.fdx _3gb.prx _uy.frq _1q2.fdx _4lm.prx _2la.prx 
_2l4.prx _4bg.fdt _4be.frq _1q7.nrm _2l5.prx _4bf.frq _v1.prx _4bd.fdt _2l9.prx _1q6.tis _3g8.fnm _4ln.tis _2l3.tis _4bc.fdx _2lb.prx _3gb.frq _3gf.frq 
_2la.fnm _3ga.fdt _uz.tis _4bg.nrm _uv.tii _4bg.tii _3g8.tii _4ll.frq _uv.fnm _2l8.tis _2l8.nrm _2l2.fdt _4bj.tis _4lk.fdx _uw.prx _4bc.prx _4bj.fdt _4be.fdx 
_1q4.frq _uu.fdt _1q1.tii _2l5.tii _2lb.fdt _4bh.frq _3ge.frq _1py.prx _1q5.nrm _v1.fdx _3g7.fdt _4ln.fdt _1q4.nrm _1py.fdt _3gc.tis _4ll.prx _v3.tis _4bf.fdx 
_1q5.fdx _1q0.prx _4bi.nrm _4ll.tis _2l4.tis _3gf.tii _v2.fnm _uu.nrm _1q0.nrm _4lm.fnm _uu.prx _2l6.frq _4ln.nrm _ux.nrm _3g6.frq _1q5.fdt _4bj.tii 
_2lb.fdx _uv.fdx _v1.frq
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:634)
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:517)
        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:306)
        at org.apache.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:408)
        at org.apache.lucene.index.DirectoryReader.isCurrent(DirectoryReader.java:797)
        at org.apache.lucene.index.DirectoryReader.doReopenNoWriter(DirectoryReader.java:407)
        at org.apache.lucene.index.DirectoryReader.doReopen(DirectoryReader.java:386)
        at org.apache.lucene.index.DirectoryReader.reopen(DirectoryReader.java:348)
        at org.infinispan.lucene.profiling.LuceneReaderThread.refreshIndexReader(LuceneReaderThread.java:79)
        at org.infinispan.lucene.profiling.LuceneReaderThread.testLoop(LuceneReaderThread.java:60)
        at org.infinispan.lucene.profiling.LuceneUserThread.run(LuceneUserThread.java:60)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

Migrated from LUCENE-2585 by Sanne Grinovero, updated Oct 17 2016 Linked issues:

asfimport commented 14 years ago

Sanne Grinovero (migrated from JIRA)

I'm going to see if I can contribute a patch myself, but I don't think I'll be able to provide a unit test.

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Background: via irc we brainstormed that the most likely cause of the exception was that listing the files in a directory is probably not atomic (at the JVM level) - hence it's possible to miss the segments file in a rapidly changing index. The simplest fix would seem to be to retry the directory listing a few times if the segments file isn't found.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Man it's hard to add a comment here. Had to scroll way to the right... silly Jira.

The best guess here is that this is due to non-atomicity of listing a directory right? Ie, Lucene, in order to find the most recent segments_N file, lists the directory. But if, as the listing is happening, a commit is done from IndexWriter, writing a new segments_N+1 and removing the old one, it's possible that the directory listing would show no segments file.

Lucene then falls back to reading segments.gen, but somehow this is also stale/unusable (probably because another commit kicked off after the dir listing and before we could read segments.gen).

I'm not sure offhand how we can fix this...

Can you describe the stress test you're running?

asfimport commented 14 years ago

Sanne Grinovero (migrated from JIRA)

sure, the test is totally open source; the directory implementation based on Infinispan is hosted as submodule of Infinispan: http://anonsvn.jboss.org/repos/infinispan/branches/4.1.x/lucene-directory/

The test is org.infinispan.lucene.profiling.PerformanceCompareStressTest

it is included in the default test suite but disabled in Maven's configuration, so you should run it manually mvn clean test -Dtest=PerformanceCompareStressTest (running it requires the jboss.org repositories to be enabled in maven settings)

To describe it at higher level: there are 5 IndexRead-ing threads using reopen() before each search, 2 threads writing to the index, 1 additional thread as a coordinator and asserting that readers find what they expect to see in the index. Exactly the same test scenario is then applied in sequence to RAMDirectory (not having issues), NIOFSDirectory, and 4 differently configured Infinispan directories. Only the FSDirectory is affected by the issue, and it can never complete the full hour of stresstest succesfully, while all other implementations behave fine.

IndexWriter is set to MaxMergeDocs(5000) and setUseCompoundFile(false); the issue is reveled both using SerialMergeScheduler and while using the default merger.

During the last execution the test managed to perform 22,192,006 searches and 26,875 writes before hitting the exceptional case.

If you deem it useful I'd be happy in contributing a similar testcase to Lucene, but I assume you won't be excited in having such a long running test. Open to ideas to build a simpler one.

asfimport commented 14 years ago

Sanne Grinovero (migrated from JIRA)

reformatted the description: all filenames where on the same line making this page hard to use.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks for the details Sanne! Your Infinispan directories sounds interesting.

http://fixunix.com/linux/356378-opendir-readdir-atomicity.html is relevant, assuming the JVM is using opendir/readdir on Linux (which I assume it is?).

Basically Posix makes no guarantee that opendir/readir will see a "point in time" directory listing. Ie, file add/deletes can be seen out-of-order, much like write operations in different threads in Java if you don't sync.

So maybe we should add an additional retry cycle in the case that we don't find a segments file? (We already have various retries if we do see a segments file in the listing, but, we hit an IOExc when trying to load it).

Sanne do you want to work out a patch?

asfimport commented 13 years ago

Sanne Grinovero (migrated from JIRA)

Hello, sorry for the late answer, for some reason I didn't see the notification.

Sure I'm very interested in providing a patch for this. I've to say I was only able to reproduce this issue in synthetic benchmarks, so I thought it might be very unlikely in real world scenarios, but now I actually received reports of people having issues with this during real use cases, so I'll definitely give another look.

Thanks for the pointers!

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

moving out... there is no patch

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

There is no patch, moving to 3.2

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

bulk move 3.2 -> 3.3

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19.

Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

asfimport commented 9 years ago

Ashutosh Deshpande (migrated from JIRA)

I seem to be hitting this issue in my application. The simplified use case is: we have about 1000 new documents being processed simultaneously by multiple threads and multiple threads potentially try to store/read the same document, but with different data. One thread searches, finds nothing and updates the document; and at the same time other thread tries to search for it before update and ends up not finding it, thus creating a duplicate document with different data instead of having a single merged document. This happens very frequently and we end up having over 50% duplicates at times. This causes an issue in search.

Is there any fix available for this issue?

Thanks and Regards, Ashutosh.

asfimport commented 7 years ago

Han-Wen NIenhuys (migrated from JIRA)

I've seen race condition diagnostics for Lucene code that could be related to this.

Line numbers are for Lucene 5.5.2, but the problem seems present in master too.

AFAICT: StandardDirectoryReader has an IndexWriter, and it checks things are up-to-date by checking if there are pending in-memory updates by looking at IndexWriter.SegmentInfos.version. That read is not synchronized relative to its write from IndexWriter.updateDocument

WARNING: ThreadSanitizer: data race (pid=21636) Write of size 8 at 0x7f24ad9d3210 by thread T36 (mutexes: write M431922409782456832): #0 org.apache.lucene.index.SegmentInfos.changed()V (SegmentInfos.java:944)
#1 org.apache.lucene.index.IndexWriter.newSegmentName()Ljava/lang/String; (IndexWriter.java:1652)
#2 org.apache.lucene.index.DocumentsWriter.ensureInitialized(Lorg/apache/lucene/index/DocumentsWriterPerThreadPool$ThreadState;)V (DocumentsWriter.java:391)
#3 org.apache.lucene.index.DocumentsWriter.updateDocument(Ljava/lang/Iterable;Lorg/apache/lucene/analysis/Analyzer;Lorg/apache/lucene/index/Term;)Z (DocumentsWriter.java:445)
#4 org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Ljava/lang/Iterable;)V (IndexWriter.java:1477)
#5 com.google.gerrit.lucene.AutoCommitWriter.updateDocument(Lorg/apache/lucene/index/Term;Ljava/lang/Iterable;)V (AutoCommitWriter.java:100)
#6 org.apache.lucene.index.TrackingIndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Ljava/lang/Iterable;)J (TrackingIndexWriter.java:55)
#7 com.google.gerrit.lucene.AbstractLuceneIndex$4.call()Ljava/lang/Long; (AbstractLuceneIndex.java:250)
#8 com.google.gerrit.lucene.AbstractLuceneIndex$4.call()Ljava/lang/Object; (AbstractLuceneIndex.java:247)
#9 com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly()V (TrustedListenableFutureTask.java:108)
#10 com.google.common.util.concurrent.InterruptibleTask.run()V (InterruptibleTask.java:41)
#11 com.google.common.util.concurrent.TrustedListenableFutureTask.run()V (TrustedListenableFutureTask.java:77)
#12 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1142)
#13 java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:617)
#14 java.lang.Thread.run()V (Thread.java:745)
#15 (Generated Stub)

Previous read of size 8 at 0x7f24ad9d3210 by thread T29 (mutexes: write M1060737507754061632): #0 org.apache.lucene.index.IndexWriter.nrtIsCurrent(Lorg/apache/lucene/index/SegmentInfos;)Z (IndexWriter.java:4592)
#1 org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(Lorg/apache/lucene/index/IndexCommit;)Lorg/apache/lucene/index/DirectoryReader; (StandardDirectoryReader.java:282)
#2 org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(Lorg/apache/lucene/index/IndexCommit;)Lorg/apache/lucene/index/DirectoryReader; (StandardDirectoryReader.java:261)
#3 org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged()Lorg/apache/lucene/index/DirectoryReader; (StandardDirectoryReader.java:251)
#4 org.apache.lucene.index.DirectoryReader.openIfChanged(Lorg/apache/lucene/index/DirectoryReader;)Lorg/apache/lucene/index/DirectoryReader; (DirectoryReader.java:137)
#5 com.google.gerrit.lucene.WrappableSearcherManager.refreshIfNeeded(Lorg/apache/lucene/search/IndexSearcher;)Lorg/apache/lucene/search/IndexSearcher; (WrappableSearcherManager.java:148)
#6 com.google.gerrit.lucene.WrappableSearcherManager.refreshIfNeeded(Ljava/lang/Object;)Ljava/lang/Object; (WrappableSearcherManager.java:68)
#7 org.apache.lucene.search.ReferenceManager.doMaybeRefresh()V (ReferenceManager.java:176)
#8 org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking()V (ReferenceManager.java:253)
#9 org.apache.lucene.search.ControlledRealTimeReopenThread.run()V (ControlledRealTimeReopenThread.java:245)
#10 (Generated Stub)

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Han-Wen NIenhuys, I think you are correct. How would you suggest we fix it? Could you open a new issue to track this? Thanks.

asfimport commented 7 years ago

Han-Wen NIenhuys (migrated from JIRA)

opened #8550 - I know too little of lucene to give a good answer for fixing this, but I gave a suggestion anyway.