Open jpountz opened 1 week ago
Nice idea @jpountz! I'll try spending sometime working on this.
@jpountz Interestingly, I have been working on this and was about to open an issue - if its ok, I will self assign this?
@albogdano I'm curious if you have any interest in contributing your https://github.com/albogdano/lucene-s3directory?
@shubhamvishu @atris Thanks for volunteering to help! I'm keen on checking if @albogdano has interest in contributing first, but even if we went that route I'm sure we'll need help to properly support new Directory APIs like IndexInput#prefetch
or also support the GCP and Azure counterparts of S3.
@jpountz Yes! How can I help you guys? My knowledge of Lucene internals is quite limited and the goal of the lucene-s3directory
was mainly to be a proof of concept.
I'm thinking of a PR that would create a new lucene/directory/s3
module where we'd check in the code.
proof of concept
What is your gut feeling: should we rather start with your code and iterate on it to make it production-ready, or would it be easier/better to start from scratch, just taking inspiration from your existing code?
You will save some time if you iterate on my code as it already implements the boring parts for integrating with the AWS SDK (version 2.x is used btw). All I did was to clone Shay's JDBCDirectory
from Compass and make it work with S3. There are no extra features.
Sounds good. Would you like to work on the PR?
Yes, of course! Are there any requirements for the PR? It would be a fairly large chunk of code for a single PR and I'm not sure if that's allowed. Should I just add the code to a new branch and push for review?
No special requirements, you may just need to adjust formatting (running ./gradlew tidy
) and make sure it conforms with other requirements that are checked by the build, like forbidden APIs.
I've been thinking about this for a bit. In addition to an S3-based directory, I believe there could be some benefit from defining an S3 (or other object store) codec inspired by Parquet.
That is, the existing Lucene formats are "pure" column-stride (i.e. fields are contiguous). If we split things into "row groups", I believe we could reduce the need for random reads and prefetches. I've been thinking of giving that a try. It's complementary to the object store directory itself, so it can be worked on independently.
There are so many projects (latest one I've heard of is Nixiesearch, presented at Haystack) trying to read Lucene indexes from S3, let's provide a S3-based (and other object stores) directory in Lucene directly?