cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.92k stars 457 forks source link

disagg: investigate supporting larger cache sizes #2544

Open RaduBerinde opened 1 year ago

RaduBerinde commented 1 year ago

The simple cache in #2541 requires all metadata to be kept in memory, which can be prohibitive for very large caches.

Investigate ways to maintain only part of the metadata in memory (@sumeerbhola suggests having a "colder" part of the cache that uses random eviction; the mapping between logical blocks and cache blocks would be held in a persistent KV store).

Jira issue: PEBBLE-188

Epic CRDB-40358

joshimhoff commented 1 year ago

the mapping between logical blocks and cache blocks would be held in a persistent KV store

The core difficulty here seems to be the need for a persistent KV store that maps <filenum, logical block index> to cache block index. Some ideas about this are below:

This is somewhat complicated as compared to storing all metadata in-memory. At the same time, if we do need to execute on this ticket, I think the KV store we need is rather limited in terms of its functional requirements:

Yiling-J commented 1 year ago

Just come across this issue when reading Pebble code. I'm now adding secondary cache to my cache package Theine and my implementation is based on CacheLib's Hybrid Cache. I think if key/value are small and fixed, you can take a look Small Object Cache in CacheLib, which only keep a bloomfilter in each bucket in memory.

Also you can consider using a Count–min sketch to evict blocks based on frequency.

joshimhoff commented 1 year ago

Thanks for linking us your package, @Yiling-J!

Persist a hash table from <filenum, logical block index> to cache block index.

One thing I realize (after talking to Radu about something else) is that the approach I suggest would have very high write amplification, unless we could delay flushing writes to the persisted hash table a long time, so that much of a 4 KB FS page is changed by flush time. Since we have a metadata WAL, this is fine at a consistency level. But delaying flushing spends memory, and the whole reason to persist the mapping from <filenum, logical block index> to cache block index is to reduce memory. Sounds... bad, but I guess I'm not sure if this write amplification issue implies the approach I suggest is untenable or not. The overall disk usage may be dominated by other things, e.g. the writing of cache blocks, implying the write amplification is fine to eat.

I guess there are reasons that the design space for KV stores backed by disk is highly constrained to be mostly LSMs or B-trees :)

joshimhoff commented 1 year ago

Persist a hash table from <filenum, logical block index> to cache block index.

New idea: I wonder if could hash <filenum, logical block index> to produce a cache block index. At the cache block index, store a header with <filenum, logical block index> and then the cache block contents. Hash functions should generate all cache block indexes in output range with the same probability. So perhaps eviction is "random" enough for our use case [1]? No KV store that stores a mapping needed, since the mapping is a pure function.

[1] is this true?

joshimhoff commented 1 year ago

Regarding [1], https://en.wikipedia.org/wiki/Hash_function is useful.

Note that this criterion only requires the value to be uniformly distributed, not random in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true.

A hash procedure must be deterministic—meaning that for a given input value it must always generate the same hash value. In other words, it must be a function of the data to be hashed, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as pseudo-random number generators or the time of day.

sumeerbhola commented 1 year ago

@joshimhoff can we just start with storing this metadata in a small Pebble DB? We can optimize after measuring. The persistent hash map is just a module in the context of the sketch in https://docs.google.com/document/d/1iG2ht94KHHCrqcq9yaNpN63G_O2vFtNk3z-kI1YYjJE/edit#bookmark=id.14kf3lzicew1 and it can be improved independently from the rest.

joshimhoff commented 1 year ago

Yes, a small pebble DB is an option. Radu & I have discussed it a bit. We are somewhat concerned about complexity, which partly motivated me to think up some other options that may be simpler.

We will be focusing on a basic secondary cache with metadata in-memory + a log of metadata changes for the persistence / consistency for some time now anyway. No urgency re: this ticket.