Intermittent FileNotFoundException for .fnm when using rsync [LUCENE-628]

asfimport commented 18 years ago

We use Lucene 1.9.1 to create and search indexes for web applications. The application runs in Jboss402 on Redhat ES3. A single Master (Writer) Jboss instance creates and writes the indexes using the compound file format , which is optimised after all updates. These index files are replicated every few hours using rsync, to a number of other application servers (Searchers). The rsync job only runs if there are no lucene lock files present on the Writer. The Searcher servers that receive the replicated files, perform only searches on the index. Up to 60 searches may be performed each minute.

Everything works well most of the time, but we get the following issue on the Searcher servers about 10% of the time. Following an rsync replication one or all of the Searcher server throws

IOException caught when creating and IndexSearcher java.io.FileNotFoundException: /..../_1zm.fnm (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212) at org.apache.lucene.store.FSIndexInput$Descriptor.<init>(FSDirectory.java:425) at org.apache.lucene.store.FSIndexInput.<init>(FSDirectory.java:434) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324) at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:56) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:154) at org.apache.lucene.store.Lock$With.run(Lock.java:109) at org.apache.lucene.index.IndexReader.open(IndexReader.java:143)

As we use the compound file format I would not expect .fnm files to be present. When replicating, we do not delete the old .cfs index files as these could still be referenced by old Searcher threads. We do overwrite the segments and deletable files on the Searcher servers.

My thoughts are: Either we are occasionally overwriting a file at the exact time a new searcher is being created, or the lock files are removed from the Writer server before the compaction process is completed, we then replicate a segments file that still references a ghost .fnm file.

I would greatly appreciate any ideas and suggestions to solve this annoying issue.

Migrated from LUCENE-628 by Simon Lorenz, resolved Dec 16 2009 Environment:

Linux RedHat ES3, Jboss402

asfimport commented 18 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

My best guess on what's happening here is, on one of your Searcher boxes:

rdist has copied over the new segments file but not yet the actual _1zm.cfs file
IndexSearcher is re-instantiated at this moment and reads this new segments file
IndexSearcher then tries to load the _1zm.cfs (referenced by the new segments file), but because it does not yet exist (rdist hasn't copied it yet), it falls back to non compound file (_1zm.fnm) which also does not exist, and hits that exception

The one thing that's odd in your traceback above is line 154 of IndexReader.java is only used when there are more than 1 segment in your index. Are you allowing rdist to make a copy after IndexWriter has added docs (and closed) but before optimize is called? Otherwise I can't explain why the index on your Searcher box has more than one segment.

Note that there are two lock files on the Writer machine: the write lock, held for a long time (whenever an IndexWriter is open), and the commit lock, held briefly while a new segments file is written.

I think you need to change your approach to more correctly use Lucene's locking:

On the Writer box, before rdist can run, it must hold (acquire) the write lock, for the full duration of the copy. Just checking that the write lock file doesn't exist isn't generally sufficient because an IndexWriter may wake up and start changing things while your rdist is running (unless that can't happen in your current design, for example if from a single Java process you close the IndexWriter, run rdist, repeat).
On each Searcher box, before rdist can copy to it, you need to acquire the commit lock and hold it for the full duration of the copy, then release it. Note that no IndexSearcher (IndexReader) can be instantiated during this time (it will block on commit lock acquire until the rdist copy is done).

Note that the Solr project:

http://incubator.apache.org/solr/features.html http://incubator.apache.org/solr/tutorial.html

has an excellent solution for correctly distributing an index from single Writer to multiple Searhcers (they call it "snaphots"). It also uses rdist to move snapshots around. You might want to try Solr, or perhaps "borrow" it's approach, especially the neat "cp -l -r" trick for quickly creating a snapshot of the index on the Writer machine.

See also this recent thread that touched on similar issues:

http://www.gossamer-threads.com/lists/lucene/java-user/37593

asfimport commented 18 years ago

Simon Lorenz (migrated from JIRA)

Hi Michael,

Many thanks for this input. Your comments are very sound and I will look into your suggestions and report back.

Cheers.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Hey Simon, anything to report back on this issue? I'd like to close it out if you have worked out what happened.

apache / lucene

Intermittent FileNotFoundException for .fnm when using rsync [LUCENE-628] #1703