Migrate HDFSDirectory from solr to lucene-hadoop [LUCENE-6536]

asfimport commented 9 years ago

I am currently working on a search engine that is throughput orientated and works entirely in apache-spark.

As part of this, I need a directory implementation that can operate on HDFS directly. This got me thinking, can I take the one that was worked on so hard for solr hadoop.

As such I migrated the HDFS and blockcache directories out to a lucene-hadoop module.

Having done this work, I am not sure if it is actually a good change, it feels a bit messy, and I dont like how the Metrics class gets extended and abused.

Thoughts anyone

Migrated from LUCENE-6536 by Greg Bowyer (@GregBowyer) Attachments: LUCENE-6536.patch

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Questions:

What will be done to deal with the bugginess of this thing? I see many reports of user corruption issues. By committing it, we take responsibility for this and it becomes "our problem". I don't want to see the code committed to lucene just for this reason.
What will be done about the performance? I am not really sure the entire technique is viable.

Personally, I think if someone wants to do this, a better integration point is to make it a java 7 filesystem provider. That is really how such a filesystem should work anyway.

asfimport commented 9 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Personally, I think if someone wants to do this, a better integration point is to make it a java 7 filesystem provider. That is really how such a filesystem should work anyway.

I agree. This is how it should be. Once HDFS provides a Java 7 FileSystemProvider SPI (see, http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileSystemProvider.html), you just need to plug your HDFS JAR file into the classpath, then you would be able to create a standard FSDirectory (NIO, Simple, mmap) using Paths.get(URI) and you are done. No single line of Lucene code needed. I have no idea why Hadoop does not yet provide a FileSystem implementation for Java 7 (maybe because they are still on Java 6).

I would suggest that you talk with the Hadoop people about doing this (including the block cache, which could me implemented as ByteBuffer like MappedByteBuffer off-heap, so it would automatically work with MMapDirectory in Lucene; I don't want to also take over responsibility for the block cache in Lucene). Or you start your own project implementing the FSProvider.

asfimport commented 9 years ago

Greg Bowyer (@GregBowyer) (migrated from JIRA)

Questions:

bq. What will be done to deal with the bugginess of this thing? I see many reports of user corruption issues. By committing it, we take responsibility for this and it becomes "our problem". I don't want to see the code committed to lucene just for this reason.

Fix its bugs ;), joking aside is it the directory or the blockcache that is the source of most of the corruptions

What will be done about the performance? I am not really sure the entire technique is viable.

My usecase is a bit odd, I have many small (2*HDFS block) indexes that get run over map jobs in hadoop. The performance I got last time I did this (with a dirty hack Directory that copied the files in and out of HDFS :S) was pretty good.

Its a throughput orientated usage, I think if you tried to use this to back an online searcher you would have poor performance.

Personally, I think if someone wants to do this, a better integration point is to make it a java 7 filesystem provider. That is really how such a filesystem should work anyway.

That is awesome I didnt know such an SPI existed in java. I have found a few people that are trying to make a provider for hadoop.

I also dont have the greatest love for this path, the more test manipulations I did the less and less it felt like a simple feature that should be in lucene. I might try to either strip out the block-cache from this patch, or use a HDFS filesystem SPI in java7.

asfimport commented 9 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

is it the directory or the blockcache that is the source of most of the corruptions

There are two issues:

The write side of the block cache is buggy and can corrupt indexes - I don't think it provides any value anyway so it should just be cut out - currently it's turned off.
The hdfs directory doesn't do a classic fsync - to get this kind of behavior you have to write files to hdfs in some really slow mode I believe - it doesn't have an API compatible with how Lucene fsyncs.

All and all the block cache performance is good enough for a ton of use cases, but the overall approach and management of it is not great. The Apache Blur project has made a better version that is better for even more uses cases, but it requires Unsafe usage for direct memory access.

asfimport commented 9 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I have found a few people that are trying to make a provider for hadoop.

Can you list what you have found?

This was some discussion / interest here: HADOOP-3518 - Want NIO.2 (JSR 203) file system provider for Hadoop FileSystem

That leads you to a test impl here: https://github.com/damiencarol/jsr203-hadoop

asfimport commented 9 years ago

Greg Bowyer (@GregBowyer) (migrated from JIRA)

That leads you to a test impl here: https://github.com/damiencarol/jsr203-hadoop

These are the people that I am talking about.

asfimport commented 9 years ago

Greg Bowyer (@GregBowyer) (migrated from JIRA)

Oh wow the blur store might be exactly what I am looking for.

apache / lucene

Migrate HDFSDirectory from solr to lucene-hadoop [LUCENE-6536] #7594