Question about the use of large segment files as storage

Sure, so to be clear Bleve moved away from the KV interface (2017/2018 time frame), and Bluge has just continued to use a similar approach to Bleve.

So then, why did Bleve move away from the KV interface?

In short, we found that using the KV abstraction at the level of the inverted index was problematic, and prevented us from easily using advanced data-structures like the FST (finite state transducer) and the roaring bitmaps library. Using the FST to encode the term dictionary allows us to efficiently find terms used in the index that match regular expressions or match with a given levenshtein distance. If you used a KV store, you have 2 obvious choices, first you could break up the dictionary data into multiple KV pairs (this is what the upsidedown index format in Bleve does). This is a fine way to encode the dictionary, but you just lost the ability to find terms by regex or fuzzy match, instead you must iterate all terms and process that logic at a higher level. Alternatively, you can store the entire FST inside the value of a KV entry. Here we found many of the "embedded" KV stores did not work well with large values (often they would force us to copy the entire value, or copy it for us without our control).

In short, the original design performed poorly, and it wasn't just a matter of plugging in a faster KV store. The scheme by which we encode the index data into KV form wasn't good enough.

A second issue is that with KV store, we are working with an abstraction that allows updating of data in place. However, often there are times it can be helpful to optimize which data is updated in place, and which data is write-once or append-only. By moving to a model where had full control over the bytes on disk, we were able to deliver a solution to performed much closer to people's expectations.

To be clear, at the time this decision was made, it was entirely about using local KV store (leveldb, rocksdb, boltb, etc) or working with files directly. In the rest of this post I go on a bit more about things like S3, so if that isn't what you were getting at, you can probably ignore it.

Obviously, the rise and popularity of cloud object storage, has lead to a different class of KV storage, and while you didn't call this out specifically, it has come up a lot recently. Many people ask how they can "store their index in s3". So far, we only have a list of substandard options:

You could use Bleve with a custom KV wrapper for s3-like api
You can use Bluge with an s3 directory implementation

Neither of these are good solutions currently.

In my opinion, much of the problem isn't "large segments" directly, but that much of the surrounding code that uses the segments makes too many assumptions. For example, today during a search, we operate on each segment one a time. In our testing that led to a good balance of query latency and throughput under the workloads we cared about. But, if all your segments are a remote call to s3, you hope to be able to process a bunch of them at the same time to compensate for the latency.

I think it's important to consider at what level we use/apply particular abstractions. My own opinion is that using segments is fine, and probably still desirable. But, if you start with the assumption that they'll be stored in s3, you would design/tune a lot of the surrounding code differently.

blugelabs / bluge

Question about the use of large segment files as storage #93