New ReplicatingDirectory mirrors index files to HDFS [LUCENE-4731]

asfimport commented 11 years ago

I've been working on a Directory implementation that mirrors the index files to HDFS (or other Hadoop supported FileSystem).

A ReplicatingDirectory delegates all calls to an underlying Directory (supplied in the constructor). The only hooks are the deleteFile and sync calls. We submit deletes and replications to a single scheduler thread to keep things serializer. During a sync call, if "segments.gen" is seen in the list of files, we know a commit is finishing. After calling the deletage's sync method, we initialize an asynchronous replication as follows.

Read segments.gen (before leaving ReplicatingDirectory#sync), save the values for later
Get a list of local files from ReplicatingDirectory#listAll before leaving ReplicatingDirectory#sync
Submit replication task (DirectoryReplicator) to scheduler thread
Compare local files to remote files, determine which remote files get deleted, and which need to get copied
Submit a thread to copy each file (one thead per file)
Submit a thread to delete each file (one thead per file)
Submit a "finalizer" thread. This thread waits on the previous two batches of threads to finish. Once finished, this thread generates a new "segments.gen" remotely (using the version and generation number previously read in).

I have no idea where this would belong in the Lucene project, so i'll just attach the standalone class instead of a patch. It introduces dependencies on Hadoop core (and all the deps that brings with it).

Migrated from LUCENE-4731 by David Arthur (@mumrah), updated May 09 2016 Attachments: ReplicatingDirectory.java

asfimport commented 11 years ago

David Arthur (@mumrah) (migrated from JIRA)

Attaching ReplicatingDirectory.java

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Why do you need such a Directory implementation? HDFS already does replication (unless you turn it off), so I wonder what does that replication give you, that HDFS replication doesn't?

asfimport commented 11 years ago

David Arthur (@mumrah) (migrated from JIRA)

The idea is to have a mirror of your index on a remote HDFS.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Did you take a look at #3706 (TeeDirectory)? I think it's similar to what you need? Perhaps you can compare the two?

Hmmm .. but the approach you've taken here is different. While TeeDirectory mimics Unix's "tee" and forwards calls to two directories, ReplicationDirectory implements ... replication.

I would not implement replication at the level of Directory, and rely on things like "when segments.gen is seen we know commit happened". It sounds too fragile of a protocol to me.

Perhaps instead you can think of a replication module which lets a producer publish IndexCommits whenever it called commit(), and consumers can periodically poll the replicator for updates, giving it their current state. When an update is available, they do the replication? Or something along those lines? IndexCommits are much more "official" to rely on, than the fragile algorithm you describe. For example, You can use SnapshotDeletionPolicy to hold onto IndexCommits that are currently being replicated, which will prevent the deletion of their files. Whereas in your algorithm, if two commits are called close to each other, one thread could start a replication action, while the next commit will delete the files in the middle of copy, or just delete some of the files that haven't been copied yet.

I think what we need in Lucene is a Replicator module :).

asfimport commented 11 years ago

David Arthur (@mumrah) (migrated from JIRA)

TeeDirectory was actually the inspiration for this. The primary difference is that I want to asynchronously copy the index files, rather than having two sync underlying Directories. The motivating use case for me is I want to run some Hadoop jobs that make use of my Lucene index, but I don't want to collocate Lucene and Hadoop (sounds like a recipe for bad performance all around). With this async strategy, commits will get to HDFS eventually, and I don't really care how far behind the lag as, so long as I have a readable commit in HDFS.

Also, regarding push vs pull, I'd rather push from Lucene to avoid having to deal with remote agents pulling.

Whereas in your algorithm, if two commits are called close to each other, one thread could start a replication action, while the next commit will delete the files in the middle of copy, or just delete some of the files that haven't been copied yet.

"Replication actions" and "delete actions" are serialized by a single thread, so they will not be interleaved.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I see, so you only allow one commit at a time. That's not great either ... e.g. if the replicating thread copies a large index commit (due to merges or something), all other processes are stopped until it finishes. This makes indexing on Hadoop even more horrible (if such thing is possible :)).

You don't have to do pull requests, you can have an agent running on the Hadoop cluster (where MapReduce jobs are run) that will poll the index directory periodically and then push the files to HDFS. The difference is that it will:

Take a snapshot on the index, so those files that it copies won't get deleted for sure.
It does not block indexing operations. If it copies a large index commit, and few commits are made in parallel by the indexing process, then when the replication process finishes, it will copy a single index commit with all recent changes. That might even make it more efficient.
You don't rely on a fragile algorithm, e.g. the detection of segments.gen.

asfimport commented 11 years ago

David Arthur (@mumrah) (migrated from JIRA)

all other processes are stopped until it finishes

Not exactly, just no other replication or delete events will happen

Take a snapshot on the index, so those files that it copies won't get deleted for sure.

Is that what the SnapshotDeletionPolicy does? This does sound more robust than watching for segments.gen - where can I see it in use? Is this what Solr uses for replication?

What would be a recommended mechanism for receiving "push requests" from a remote agent? Does lucene have any kind of RPC server like Thrift built-in (I imagine not).

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Not exactly, just no other replication or delete events will happen

Well in that case then you could run into troubles. I.e. imagine two threads, one doing commit() and one doing replication. The commit() thread could be much faster than the replication one. Therefore, it can do commit(#1), replication thread starts to replication that index commit. In the middle, the commit thread does commit(#2), which deletes some files of the previous commit (e.g. due to segment merging), and the replication thread will be left with a corrupt commit ...

Is that what the SnapshotDeletionPolicy does

Yes. You can see how it's used in the tests. Also, here's a thread from the user list with an example code: http://markmail.org/message/3novogsi6vcgarur.

I am not sure if Solr uses it, but I think it does. I mean .. it's the "safe" way to replicate/backup your index.

Lucene doesn't have an RPC server built-in .. I wrote a simple Servlet that responds to some REST API to invoke replication ...

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

apache / lucene

New ReplicatingDirectory mirrors index files to HDFS [LUCENE-4731] #5796