cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.72k stars 434 forks source link

db: consider memtable optimizations for duplicate keys #2683

Open jbowens opened 1 year ago

jbowens commented 1 year ago

With expiration-based leases enabled, CockroachDB repeatedly overwrites the same set of keys (corresponding to a lease per KV range). With our existing skiplist memtable implementation, each duplicate key written will copy the entire key into the memtable arena. These new, frequent writes can quickly fill a memtable forcing a flush. Triggering early flushes hurts write bandwidth efficiency, because it makes it less likely that the raft log entries contained within the memtable will be truncated before the flush.

We could consider changes to our memtable implementation that reduce the memory overhead of duplicate keys.

There's a paper about Trie Memtables in Cassandra that may prove illustrative.

Jira issue: PEBBLE-218

jbowens commented 1 year ago

There’s a very simple optimization we can pursue. Currently key trailers are encoded with the user key. We can lift them onto the node struct. Then during insertion, if we find a key with the same user key, we encode its offset in the new node instead of copying.

jbowens commented 1 year ago

@erikgrinaker — fyi, in case you observe high flush volume affecting workloads with expiration-based leases enabled

jbowens commented 1 year ago

An alternative: construct the skiplist over just user keys. If an inserter discovers a node already exists with the current user key, extend the node by allocating a special "extentNode" that encodes the trailer, a next pointer (to the next extentNode), and the value (varint encoded). The primary node struct would be extended with an extentNode pointer. This allows a tombstone to be encoded with just 16 additional bytes (8 byte trailer + 4 byte next extent node + 1 byte zero value len varint + 3 bytes padding). If an iterator positions on a node with an extent, it'll need to descend the extent list buffering if necessary. If two batches modify a key simultaneously, the latter one will have already alloced an ordinary skiplist node, resulting in some wasted space. But both these cases should be rare.