basho / bitcask

because you need another a key/value storage engine
1.29k stars 173 forks source link

Consider replacing keydir with spill to disk hash [JIRA: RIAK-2737] #204

Open engelsanchez opened 10 years ago

engelsanchez commented 10 years ago

The main goal of this design is to replace the Bitcask keydir data structure with one that can handle key sets larger than memory. The requirement and passing criteria for this work: Performance when the data set fits in memory should be comparable or better than the current solution. When the data set is larger than memory, overall Bitcask performance should be comparable or superior to LevelDB. If that can't be achieve this way, throw it in the bit bucket.

This design allows concurrent access by many readers, writers and iterators. The current keydir data structure has a single lock that allows only one operation on the keydir at a time. Here each page will have its own lock and the global keydir lock will only be held when modifying the keydir itself (for example, during iterator creation). Part of the performance benefit will require Riak and the rest of Bitcask to change to allow multiple operations to happen in parallel.

The main components of this design are:

The gory details can be found in the design document

This is work that I will be doing in my spare time. It is stuff I find interesting to experiment with and my brain needs to keep its sanity. I would love to hear feedback, technical criticisms, entirely different and interesting directions that should be pursued instead, etc. Comment away.

evanmcc commented 10 years ago

Haven't gone through all of the design docs yet, but would love two see two things added:

engelsanchez commented 10 years ago

Unfortunately I don't have any reference paper for this. I would love to know if you know of any to avoid wasting my time on bad paths. My naive search for "spill to disk hash table" didn't really turn up anything relevant. Searching for "hot data and virtual memory" got me some pages of Readings in Database Systems, which I promptly bought. But the paper in question was about executing joins and no other paper in there seem similar enough. So I'm down $60, but no reference.

As for the time complexity: without considering the fact that some pages are memory mapped and thus there is the probability of a heavy penalty for operations once you consume the entire memory buffer, the data structure is a very plain hashtable where items with the same hash are laid out sequentially in pages instead of in a list, for example. The table is never rehashed. So in theory it is still O(1) + cost of scanning a series of pages which is proportional to the data set size. I am not sure how to compute time complexities that are more meaningful for the interesting scenarios where any number of the pages might generate I/O. If you assume that we can not recommend its use if the hot data does not fit in memory, because you fall off the memory/disk cliff, you could say that it should stay in the vicinity of O(1) + k * num_keys or something.

jsvisa commented 8 years ago

@engelsanchez Looks forward to it, but is there any news about the implement?

I just come up with a simple implement, replace the keydir data struct with a LevelDB instance, the implement is easy, but not for the time complex and any other criticals.