blugelabs / bluge

indexing library for Go
Apache License 2.0
1.9k stars 125 forks source link

Directory interface for S3: lazy loading #88

Open prabhatsharma opened 2 years ago

prabhatsharma commented 2 years ago

I have implemented the directory interface for storing index data in s3. Functionally its working well. It has a couple of challenges though.

e.g. https://github.com/prabhatsharma/bluges3/blob/main/directory/s3.go#L105 and https://github.com/prabhatsharma/zinc/blob/s3/directory/s3.go (Both are same code)

  1. Whenever the writer is opened for the bluge index stored in s3 it executes Load function (A writer is opened for every existing index on start in Zinc). which in turn proceeds to download the entire index in the s3 bucket. If the data in s3 is 500 MB it will start downloading and is a blocking activity for the application. This is relatively a minor problem since I can parallelize the GetObject requests and make them faster.
  2. Once bluge gets the data from s3 it keeps everything in memory which increases application memory requirements. If I have a 50GB bluge index (or a sharded set of indexes of 500 GB) that would make it impractical. This is a bigger challenge.

I would prefer to make a GetObject request to s3 only whenever there is a need to search something from the index and get data from s3 in real time.

What would be the best approach to handle this situation?

mschoch commented 2 years ago

There are a few minor things you can do, but unfortunately the index data being local to the Bluge code was an assumption of the original design. Most likely you'll need a new segment file format, and probably significant changes to the core index in Bluge in order for this to work well.

  1. You could probably adjust the code so that opening the Writer downloads less data, but there would be little point. As soon as you want to read or write ANY index data, we need ALL of the segments to be usable by the the Bluge code (which means their entire contents is local and able to be loaded into memory. Why does a write operation require all the existing segments? Because we have search across all of them to find any previous documents with the same _id to invalidate them. Why does a search operation require all the existing segments? The segment is just one portion of the index, but we no nothing else about it, so we cannot skip any segments, all searches operate on all segments.
  2. I would expect you'd stream the downloaded segments to disk, and then mmap them from there. Or perhaps reverse it and use the mmap to do the persistence, not sure what integration points the s3 libraries will offer.

I would prefer to make a GetObject request to s3 only whenever there is a need to search something from the index and get data from s3 in real time.

As I indicated above this won't be possible in the current design, because all read/write operations indirectly need to be able to search all the segments, so no operations can be made without them.

To me the larger issue is that you need to work with some other unit of data at the transport layer. In Bluge, reasonable sized indexes will contain segments that can get large. How large doesn't really matter, they will always be too large to assume you always want to download the entire thing.

I would think you'd eventually need to make it so that when Bluge loads the requested segments, no real work is done. Instead, everything is set up such that subsequently, when Bluge tries to read the segment data, you load it on-demand. This would allow you to use range requests to access only the parts of the segment files you really need. There might be some follow-on work to better layout the data in the segment itself.

I'm not trying to be negative about this, it's just that there is no easy lazy-loading of segments, because you always require all segments.

prabhatsharma commented 2 years ago

Understand the challenge.

You could probably adjust the code so that opening the Writer downloads less data, but there would be little point. As soon as you want to read or write ANY index data, we need ALL of the segments to be usable by the the Bluge code (which means their entire contents is local and able to be loaded into memory. Why does a write operation require all the existing segments? Because we have search across all of them to find any previous documents with the same _id to invalidate them. Why does a search operation require all the existing segments? The segment is just one portion of the index, but we no nothing else about it, so we cannot skip any segments, all searches operate on all segments.

Practically speaking if I download entire 25GB of data (say ~70 segments, each ranging from 200 MB to 500 MB) and my application is running on an ec2, it is still economical (and practical) for me to download the whole data of 50 GB every time in parallel using GET range requests (Amazon athena follows this model to a certain extent).

I will dig deeper into the bluge code to see which areas may require modification. Any pointers would be great.

mschoch commented 2 years ago

Can you share about how you use a system designed this way?

mooijtech commented 2 years ago

I tested loading in Enron PST (basically a .zip file for emails) via my go-pst library, reading from MinIO (check it out, it supports the S3 protocol) but reading took 5 minutes instead of only 5-10 seconds from disk. I expect you will encounter the same performance issues. S3 should be used for storage (backups) not live reading/writing.

mschoch commented 2 years ago

@mooijtech I understand this perspective, and I think many may share it, that was why I asked the questions I did. My understanding is a bit different, and that there are interesting solutions out there using object storage for live reading/writing.

Here are a few links I know about. These are smart people I have respect for designing these systems, so rather than dismiss them, it is better to ask questions and learn about the design.

These are just two that directly relate to search, there are other projects with similar designs in related areas.

prabhatsharma commented 2 years ago

@mooijtech s3 is only for backup and not live reading/writing is a fallacy. Amazon athena and Redshift Spectrum regularly read peta bytes of live data. The trick is to download data using get ranges in parallel.

When you say you have a node in ec2 and it downloads 25GB of data, how long does this node run? Is this a long-running node, or is it transient?

This node ideally would be long running. However could be transient too, based on if we want to implement autoscaling.

Is this node updating the data in s3? (the current code makes no attempt to ensure there is only one writer, so if you are writing to the index, you can only have one such node at a time)

Totally agree with you. Locking will will need some intelligent implementation and coordination among nodes making sure that same object is not being written to by multiple nodes.

If you are updating the index over time, what is the point of constantly pushing data to s3? Do you have other nodes periodically re-downloading the data from s3 and using it in a read-only mode?

Use case here would be to push large batches of data as and when they arrive in batches. Does not have to be realtime. Could be delayed by couple minutes to couple hours (based on needs). Users can then use the data as and when they want. Amazon Athena is a good parallel to think about together with kinesis firehose.

Data -> kinesis firehose (batches data and pushes to s3) -> amazon athena (User can query data when data is available)

mooijtech commented 2 years ago

I would like to use MinIO/S3 as the storage backend so I would also like to fix the performance issue. I asked MinIO (on Slack) and was told:

minio / s3 is an object store not a file store. 
that means it stores and retrieves objects at a time and is not tuned for random io.

So for random I/O this isn't doable (performance wise), although I really want it to be.

although my inner dev is screaming at me to write some kind of middle tier random io 
memory object that holds a cache of the remote file and can be used to access it as its streamed in.

you could use something like s3fs to simply mount the bucket and
then use your normal file io to process the file

s3fs will do the object read and cache the file. and with your usage pattern (where its read-only) 
the cache should provide the performance you need.

but i can't speak for how much space will be consumed caching files that you don't need any more.

oh, before I forget, rsync has the same mount a s3 bucket capability and it might, if anything,
be more robust than s3fs

It's currently 8m55.289784405s vs 5-10 seconds from disk, which isn't doable.

prabhatsharma commented 2 years ago

@mooijtech

Minio folks are right. Object stores are not optimized for random io or latency but for throughput.

You should take a look at https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html to improve performance.

mooijtech commented 2 years ago

My solution was to download the PST files to disk then read them, takes 1.2 seconds (including downloading) for a 27MB file. For the index segments I am making nodes download them from MinIO and back them up periodically.

Thanks for the feedback, my problem was that I was trying to read from MinIO directly instead of just downloading to disk first.