db: blob storage / WiscKey-style value separation

petermattis commented 5 years ago

Background

The WiscKey paper showed the value of segregating storage for large values in order to reduce write amplification. Badger is a Go KV store implementing that design. RocksDB has support for storing blobs (binary large objects) outside of the main LSM. Unfortunately, this blob-db is incompatible with the normal RocksDB API, making it unusable for us as-is. This issue outlines how I think blob storage should be added to Pebble.

Overview

The base WiscKey idea is that instead of storing large values inline in the LSM, they are stored in a separate value-log. The LSM merely contains a value-pointer. Storing large values outside of the LSM significantly reduces the size of the LSM which improves caching and reduces the overhead of compactions.

The preliminary blob support in RocksDB has outlined the basic structure for how this could work in Pebble. A new InternalKeyKindBlobIndex is added indicating that the value for a key is stored elsewhere. The value looks like:

+-------------+----------+----------+
| file number | offset   | size     |
+-------------+----------+----------+
| varint64    | varint64 | varint64 |
+-------------+----------+----------+

(Note that RocksDB also provides support for a TTL on blobs, but I don't think this is usable for CockroachDB and therefore not needed in Pebble).

Blobs are stored in a series of blob logs. A background process periodically walks over the blob logs, reading the entries and checking to see if the value is still "live" in the LSM. This process is called value GC. If the value is still "live", it is written to the active blob log, and the LSM is updated.

Critique

An LSM does not allow mutating existing entries. In order to change the value for a key, a new value needs to be written which masks the old value. In Pebble/RocksDB/Badger, each version of a key/value pair has an associated sequence number (timestamp in Badger). RocksDB/Pebble take advantage of these sequence numbers to provide immutable snapshots. But the immutability of existing values presents a problem for blob storage. How do we update the LSM when a value is moved from one blob log to another? AFAICT, RocksDB's blob-db punts on this issue and just updates the current version. Badger at least attempts to deal with the problem:

// IMPORTANT: We should never write an entry with an older timestamp
// for the same key, We need to maintain this invariant to search for
// the latest value of a key, or else we need to search in all tables
// and find the max version among them.  To maintain this invariant,
// we also need to ensure that all versions of a key are always
// present in the same table from level 1, because compaction can push
// any table down.
//
// Update (Sep 22, 2018): To maintain the above invariant, and to
// allow keys to be moved from one value log to another (while
// reclaiming space during value log GC), we have logically moved this
// need to write "old versions after new versions" to the badgerMove
// keyspace. Thus, for normal gets, we can stop going down the LSM
// tree once we find any version of the key (note however that we will
// ALWAYS skip versions with ts greater than the key version).
// However, if that key has been moved, then for the corresponding
// movekey, we'll look through all the levels of the tree to ensure
// that we pick the highest version of the movekey present.

The way this logical move is implemented in badger, is to rewrite the key as !badger!move<key> (the literal string !badger!move is prepended to the key). I am not fond of this design as special logic is used on the read-side in order to find the location for a value.

The Blob Level

Blob pointers uniquely identify a blob. Blob pointers need to be immutable after they are created, because due to LSM immutability, we can't update the blob pointer without violating a core tenant of an LSM. A key observation: we don't have to update the blob pointer when a blob moves, we just need to be able to find a blob given the original blob pointer even if the blob's actual location changes. A blob pointer is composed of <file-num,offset>. File numbers are monotonically increasing and never reused, so we don't have to worry about a blob pointer ever being reused. We just need to provide an index that allows locating a blob by its blob pointer. In an LSM, the readily available indexing structure is an sstable.

Putting this together, blobs are stored in a series of logs and sstables. When first created, there is a single log that is being appended to (blog is short for blob-log):

  000007.blog

As blobs get written to this log, they are given blob pointers with file-num=7 and the offset where they are written in the log. As the log fills up, additional ones are added:

  000007.blog  000010.blog  000015.blog

A background process periodically loops over these files and rewrites them, only retaining blobs that are still "live" in the LSM. The notable difference from Badger/blob-db, is that rewritten blobs are stored in an sstable. For example, if 000007.blog contains 2 "live" blobs with blob pointers <7,0> and <7,1000>, GC will rewrite these blobs into an sstable where the keys in the sstable are <7,0> and <7,1000>, and the values are the blob data. It is likely that multiple adjacent logs / sstables would be processed into a single rewritten sstable. In this example, 000007.blog and 000010.blog could be rewritten into 000031.sst:

  000031.sst  000015.blog

This sstable+log structure can be viewed as a single level in an LSM. Just as for a normal level, the metadata for this blob level is []fileMetadata which would be persisted to the MANIFEST along with all other LSM metadata. Locating a blob would involve a binary search for the file containing the blob pointer, just as retrieving a key in the main LSM involves a binary search for the sstable containing the key. If the file containing the blob is a log, we can seek to the desired offset to read the blob. Otherwise we retrieve the blob from the sstable using a normal sstable lookup.

Triggering GC

When should GC be triggered? Simply walking through the files in the blob level in order is one approach, but we can do better. We should GC a blob file when it has a significant fraction of "dead" data. We can keep an exact count of this dead data by tracking when blob pointers are dropped during compactions (because a tombstone covering the blob has reached the bottommost level). A compaction would keep a []blobPointer slice that it populates with dropped blob pointers. When the compaction is complete, the dropped blob pointers are sorted and the sorted blob pointers are walked in parallel with the in-memory blob file metadata, updating a fileMetadata.deadBytes field. Recall that this blob file metadata will be persisted to the MANIFEST as well, so the deadBytes field will be precise.

A background process would periodically wake up and look at the blob file metadata, looking for contiguous runs of files that will "compact" down to much smaller size. Note that we'll be able to accurately predict both the space savings and the size of the new file.

Jira issue: PEBBLE-183

Epic CRDB-20379

petermattis commented 5 years ago

Cc @ajkr, @nvanbenschoten

One other bit to figure out is what threshold to consider a value as large (and thus a blob). Badger makes the configurable, defaulting to 32 bytes. That seems small in my opinion. My intuition is that something around 1KB is probably right.

ajkr commented 5 years ago

Interesting. I always wondered how the blob index would be updated after GC. Some thoughts:

Sorting by original file number / offset doesn't seem ideal for scans due to no possibility of readahead.
BlobDB suffered from a problem where, due to appending each blob to the blob-log during write, each one had to be compressed individually. Compressing 1KB blobs achieves less effective compression ratio than compressing 32KB data blocks. I think with TitanDB's design (first write to the WAL/memtable as usual, then later during flush, sort and write out both a blob-log and an L0 SST where values point into the blob-log) grouping blobs for compression seems possible, though AFAIK they don't do it and I haven't thought through the GC implications. Or dictionary compression could be used on 1KB blobs to mitigate (but not completely fix) the compression ratio regression.
How do we reason about write-amplification incurred by GC vs. the write-amplification that would've happened without blob separation?

petermattis commented 5 years ago

Sorting by original file number / offset doesn't seem ideal for scans due to no possibility of readahead.

Which sorting are you referring to? The sorting that takes place of the blob pointers after a compaction?

BlobDB suffered from a problem where, due to appending each blob to the blob-log during write, each one had to be compressed individually. Compressing 1KB blobs achieves less effective compression ratio than compressing 32KB data blocks. I think with TitanDB's design (first write to the WAL/memtable as usual, then later during flush, sort and write out both a blob-log and an L0 SST where values point into the blob-log) grouping blobs for compression seems possible, though AFAIK they don't do it and I haven't thought through the GC implications. Or dictionary compression could be used on 1KB blobs to mitigate (but not completely fix) the compression ratio regression.

I'm not familiar with TitanDB's design. Do you have a link to something I can read?

When blob files are rewritten as sstables, we'd automatically get the compression benefits of compressing blocks instead of individual blobs.

How do we reason about write-amplification incurred by GC vs. the write-amplification that would've happened without blob separation?

I don't know. Perhaps the WiscKey paper has thoughts about this. We can reduce GC-induced write amplification by increasing space amplification. I'll think about this.

petermattis commented 5 years ago

The TitanDB design is interesting. It looks like every sstable has an associate set of blob files (which seem to have an sstable-style structure). Blobs are indexed by key in the blob file, so there are similarities to what I described above.

One advantage to writing the blobs to the WAL/memtable and then segregating them at flush, is that doing so removes a sync. If the blobs are segregated at commit time, then we need to sync the blob log and the WAL. I think that is worthwhile. That also opens the possibility of removing blob logs. Perhaps they are useful as a performance win (seeking to an offset is faster than a lookup in an sstable), but flushing could easily just create blob sstables.

ajkr commented 5 years ago

Which sorting are you referring to? The sorting that takes place of the blob pointers after a compaction?

Right. As you mentioned in Titan the blobs are indexed by key so they are ordered (at least within a blob-log) which can help range scans.

I don't know. Perhaps the WiscKey paper has thoughts about this. We can reduce GC-induced write amplification by increasing space amplification. I'll think about this.

I'm still thinking about this too and don't have the intuition why GC is fundamentally better than compaction. The WiscKey paper didn't help as their analytic calculations omitted GC and their write-amp experiments were for insertion-only DBs. This was more helpful (http://smalldatum.blogspot.com/2018/07/indexlog-alternative-to-lsm.html) but I'm still not sure why it'd have better effects than reducing fanout (max_bytes_for_level_multiplier) or using tiered compaction.

petermattis commented 5 years ago

Right. As you mentioned in Titan the blobs are indexed by key so they are ordered (at least within a blob-log) which can help range scans.

Got it.

This was more helpful (http://smalldatum.blogspot.com/2018/07/indexlog-alternative-to-lsm.html) but I'm still not sure why it'd have better effects than reducing fanout (max_bytes_for_level_multiplier) or using tiered compaction.

We can't generally use tiered compaction unless we support multiple column families. I suppose blobs could be considered as a specialized versions of column families that are enabled "automatically". The GC of the blob level is essentially the same as universal compaction (I think, my understanding of the terminology is sometimes confused). I think it is interesting that the blob level allows trading of space amplification for write amplification. Level-based compaction trades of space+read amplification for write amplification. My suspicion is that segregated blob storage is a useful enabler of the design space, but I'm still searching for that better intuitive understanding of how to tune the rate of GC.

nvanbenschoten commented 5 years ago

Badger makes the configurable, defaulting to 32 bytes. That seems small in my opinion. My intuition is that something around 1KB is probably right.

32 bytes does seem small to me as well. Are these blobs stored in the block cache like normal data blocks? How does that work?

Compacting blobs down back into ssts is interesting. @ajkr's question about "How do we reason about write-amplification incurred by GC vs. the write-amplification that would've happened without blob separation?" sums up my uncertainty too. Perhaps naively, I would expect that GCing blobs back into an SST would undo a lot of the benefit we get from segregated value storage.

When I first read this proposal I thought you were suggesting that we introduce an extra level of indirection through an sst-backed indexing structure pointing into the blog files. I wonder if there's something to explore there.

Also, I couldn't help but think of https://www.flickr.com/photos/k-80/4663400017 while reading through this.

mdcallag commented 5 years ago

I missed the issue of moving values and the LSM. Thanks for making that clear to me.

I expect that cache-amp for index+log is worse than you mention above when trying to achieve at most one IO per point query -- but maybe that constraint is less important to you. If value log GC/compaction requires index queries to determine whether a key is live then that is also likely to need a cached index or GC might be slow/inefficient.

My understanding of value log management that you propose is: 1) write value log segments - 1.blog, 2.blog, 3.blog, ...

2) remember the order in which these are created, call this the value log array (VLA)

for example the VLA might have [1.blog, 2.blog, 3.blog]

3) over time merge adjacent entries in VLA that are replaced by merge output

this copies out live values from merge input
index query is required to determine whether a value is live (thus my warning about cache-amp)
you mention above how to spot VLA entries likely to benefit from merging
merge output creates SST file, not a blog
key for SST file is filename, offset from *.blog in which KV pair first written
block compression is easy for SST files, maybe use per-record for *.blog files

Examples:

some inserts into 1.blog and 1.blog is full -> VLA = [1.blog]
some inserts into 2.blog and 2.blog is full -> VLA= [1.blog, 2.blog]
repeat until VLA = [1.blog, 2.blog, ..., 9.blog]
merge 1.blog & 2.blog into 1.sst -> VLA=[1.sst, 3.blog, ..., 9.blog]
merge 4.blog & 5.blog into 2.sst -> VLA=[1.sst, 3.blog, 2.sst, 6.blog, ..., 9.blog]
merge 6.blog & 7.blog into 3.sst -> VLA=[1.sst, 3.blog, 2.sst, 3.sst, 8.blog, 9.blog]
merge 1.sst & 3.blog into 4.sst -> VLA=[4.sst, 2.sst, 3.sst, 8.blog, 9.blog]
merge 2.sst and 3.sst into 5.sst -> VLA=[4.sst, 5.sst, 8.blog, 9.blog]

This requires a persistent mapping to get from .blog name to current container (.blog or .sst). Assume this is an map from .blog to current file. It would be nice if there were a way to eventually remove entries from that map that are no longer needed. If a .blog file is 64 MB then there are 2^25 .blog files per PB written. Questions about how many PB might be written over the lifetime of database are why I am curious about being able to GC that map.

You don't owe me an answer but if you are building this index+log from scratch, are you sure that an LSM is the best choice for the index given the complexity above (the inability to update the index with the new location of the value, the extra search required to find the value, the size of the map, GC for the map, the need to cache the LSM to make GC queries efficient)?

Note that row cache works better with value log entries when key is stable

petermattis commented 5 years ago

My understanding of value log management that you propose is:

Yep, this all looks accurate.

This requires a persistent mapping to get from .blog name to current container (.blog or .sst). Assume this is an map from .blog to current file. It would be nice if there were a way to eventually remove entries from that map that are no longer needed. If a .blog file is 64 MB then there are 2^25 .blog files per PB written. Questions about how many PB might be written over the lifetime of database are why I am curious about being able to GC that map.

The mapping you're describing is simply an array of fileMetadata akin to a level in the LSM. I believe I describe this above. There are only as many entries in this array as data in the value log.

You don't owe me an answer but if you are building this index+log from scratch, are you sure that an LSM is the best choice for the index given the complexity above (the inability to update the index with the new location of the value, the extra search required to find the value, the size of the map, GC for the map, the need to cache the LSM to make GC queries efficient)?

This isn't actively being built. I wrote the above after trying to understand how other index+log systems work. I'm not convinced it is worthwhile and the expense of GC queries is definitely a concern. The inability to update the location of the value in the LSM is quite the oddity I discovered in the WiscKey approach. But the reason to use an LSM is if the application is already using an LSM and wants blob storage for some reason. One place this could be useful in CockroachDB's usage is for Raft log entries which are written once and read almost never. On the other hand, there might be an even more bespoke structure that is better for Raft log storage (note that in CockroachDB there are 10s of thousands of Raft logs per node, so it isn't feasible to actually use a separate file per Raft log).

petermattis commented 5 years ago

I was playing around with Badger today and noticed something: it doesn't perform automatic value log GC. Instead you have to call DB.RunValueLogGC periodically from your application. That certainly simplifies the "Triggering GC" question, yet it also makes Badger's write throughput and write amplification look better than it would in a full setup.

bra-fsn commented 4 years ago

Many filesystems now provide an interface or another possibility to punch holes into files, that is to make byte ranges in them sparse (freeing up space). For example on a compressed zfs, writing out zeroes means that range will automatically be sparse. Maybe you could add this as an option, instead/along of rewriting the logs, as it would spare a huge amount of read/write io. Because there are more implementations for this, a settable strategy could be used (eg. unset means the current behaviour of rewriting, zeroing means no rewrites and in-place zeroing out deleted ranges, other, platform specific syscalls (Linux, Solaris fallocate) means punching holes programmatically)

github-actions[bot] commented 2 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to Pebble!

sumeerbhola commented 2 years ago

also see http://rocksdb.org/blog/2021/05/26/integrated-blob-db.html

jbowens commented 7 months ago

This is a bit of an abstract idea, but I wonder if there’s a point in the solution space where we combine this with #3453. Abstractly, blob storage can be thought of as dictionary compression storing the dictionary externally without breaking the value apart. I’m suspicious that some usages of storing large values involve significant duplication between values.

cockroachdb / pebble