apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

Add an S3-based directory. #13868

Open jpountz opened 1 week ago

jpountz commented 1 week ago

There are so many projects (latest one I've heard of is Nixiesearch, presented at Haystack) trying to read Lucene indexes from S3, let's provide a S3-based (and other object stores) directory in Lucene directly?

shubhamvishu commented 1 week ago

Nice idea @jpountz! I'll try spending sometime working on this.

atris commented 5 days ago

@jpountz Interestingly, I have been working on this and was about to open an issue - if its ok, I will self assign this?

jpountz commented 5 days ago

@albogdano I'm curious if you have any interest in contributing your https://github.com/albogdano/lucene-s3directory?

@shubhamvishu @atris Thanks for volunteering to help! I'm keen on checking if @albogdano has interest in contributing first, but even if we went that route I'm sure we'll need help to properly support new Directory APIs like IndexInput#prefetch or also support the GCP and Azure counterparts of S3.

albogdano commented 5 days ago

@jpountz Yes! How can I help you guys? My knowledge of Lucene internals is quite limited and the goal of the lucene-s3directory was mainly to be a proof of concept.

jpountz commented 5 days ago

I'm thinking of a PR that would create a new lucene/directory/s3 module where we'd check in the code.

proof of concept

What is your gut feeling: should we rather start with your code and iterate on it to make it production-ready, or would it be easier/better to start from scratch, just taking inspiration from your existing code?

albogdano commented 5 days ago

You will save some time if you iterate on my code as it already implements the boring parts for integrating with the AWS SDK (version 2.x is used btw). All I did was to clone Shay's JDBCDirectory from Compass and make it work with S3. There are no extra features.

jpountz commented 5 days ago

Sounds good. Would you like to work on the PR?

albogdano commented 5 days ago

Yes, of course! Are there any requirements for the PR? It would be a fairly large chunk of code for a single PR and I'm not sure if that's allowed. Should I just add the code to a new branch and push for review?

jpountz commented 5 days ago

No special requirements, you may just need to adjust formatting (running ./gradlew tidy) and make sure it conforms with other requirements that are checked by the build, like forbidden APIs.

msfroh commented 5 days ago

I've been thinking about this for a bit. In addition to an S3-based directory, I believe there could be some benefit from defining an S3 (or other object store) codec inspired by Parquet.

That is, the existing Lucene formats are "pure" column-stride (i.e. fields are contiguous). If we split things into "row groups", I believe we could reduce the need for random reads and prefetches. I've been thinking of giving that a try. It's complementary to the object store directory itself, so it can be worked on independently.