IntersectMBO / lsm-tree

A Haskell library for on-disk tables based on LSM-Trees
Apache License 2.0
24 stars 7 forks source link

Allow chunks that are larger than the maximum chunk size in index construction #303

Open jorisdral opened 1 month ago

jorisdral commented 1 month ago

When incrementally constructing a compact/ordinary index and the max chunk size is exceeded, then currently we might split up the output into multiple chunks if the serialised form of the index has a size that is mutiple of the max chunk size. It might be worth it to just return a larger chunk in that case (maybe still a multiple of the max chunk size), since we don't really rely on chunks being a specific size anywhere in the RunBuilder/RunAcc code

jeltsch commented 1 month ago

As we agreed in our project meeting, this seems to be the way to go indeed. Concretely, we concluded in the meeting that, whenever the size of the buffered serialized data exceeds a certain threshold, all available data should be output in form of a single chunk. This particularly means the following:

The last point is justified, because in practice chunks will still not become so large that the work of writing the index isn’t appropriately spread over time, which is particularly because serialized keys aren’t expected to be large.

With this new approach, the output of appending to an index shouldn’t have type [Chunk] anymore but rather type Maybe Chunk, as there can be at most one chunk only.

I will implement this new approach of chunk generation already as part of #296 and #299. For the compact index, it remains to be implemented (potentially by using the general-purpose chunk handling to be added by #296).