Open petermattis opened 5 years ago
Cc @ajkr, @nvanbenschoten
One other bit to figure out is what threshold to consider a value as large (and thus a blob). Badger makes the configurable, defaulting to 32 bytes. That seems small in my opinion. My intuition is that something around 1KB is probably right.
Interesting. I always wondered how the blob index would be updated after GC. Some thoughts:
Sorting by original file number / offset doesn't seem ideal for scans due to no possibility of readahead.
Which sorting are you referring to? The sorting that takes place of the blob pointers after a compaction?
BlobDB suffered from a problem where, due to appending each blob to the blob-log during write, each one had to be compressed individually. Compressing 1KB blobs achieves less effective compression ratio than compressing 32KB data blocks. I think with TitanDB's design (first write to the WAL/memtable as usual, then later during flush, sort and write out both a blob-log and an L0 SST where values point into the blob-log) grouping blobs for compression seems possible, though AFAIK they don't do it and I haven't thought through the GC implications. Or dictionary compression could be used on 1KB blobs to mitigate (but not completely fix) the compression ratio regression.
I'm not familiar with TitanDB's design. Do you have a link to something I can read?
When blob files are rewritten as sstables, we'd automatically get the compression benefits of compressing blocks instead of individual blobs.
How do we reason about write-amplification incurred by GC vs. the write-amplification that would've happened without blob separation?
I don't know. Perhaps the WiscKey paper has thoughts about this. We can reduce GC-induced write amplification by increasing space amplification. I'll think about this.
The TitanDB design is interesting. It looks like every sstable has an associate set of blob files (which seem to have an sstable-style structure). Blobs are indexed by key
in the blob file, so there are similarities to what I described above.
One advantage to writing the blobs to the WAL/memtable and then segregating them at flush, is that doing so removes a sync. If the blobs are segregated at commit time, then we need to sync the blob log and the WAL. I think that is worthwhile. That also opens the possibility of removing blob logs. Perhaps they are useful as a performance win (seeking to an offset is faster than a lookup in an sstable), but flushing could easily just create blob sstables.
Which sorting are you referring to? The sorting that takes place of the blob pointers after a compaction?
Right. As you mentioned in Titan the blobs are indexed by key
so they are ordered (at least within a blob-log) which can help range scans.
I don't know. Perhaps the WiscKey paper has thoughts about this. We can reduce GC-induced write amplification by increasing space amplification. I'll think about this.
I'm still thinking about this too and don't have the intuition why GC is fundamentally better than compaction. The WiscKey paper didn't help as their analytic calculations omitted GC and their write-amp experiments were for insertion-only DBs. This was more helpful (http://smalldatum.blogspot.com/2018/07/indexlog-alternative-to-lsm.html) but I'm still not sure why it'd have better effects than reducing fanout (max_bytes_for_level_multiplier
) or using tiered compaction.
Right. As you mentioned in Titan the blobs are indexed by key so they are ordered (at least within a blob-log) which can help range scans.
Got it.
This was more helpful (http://smalldatum.blogspot.com/2018/07/indexlog-alternative-to-lsm.html) but I'm still not sure why it'd have better effects than reducing fanout (max_bytes_for_level_multiplier) or using tiered compaction.
We can't generally use tiered compaction unless we support multiple column families. I suppose blobs could be considered as a specialized versions of column families that are enabled "automatically". The GC of the blob level is essentially the same as universal compaction (I think, my understanding of the terminology is sometimes confused). I think it is interesting that the blob level allows trading of space amplification for write amplification. Level-based compaction trades of space+read amplification for write amplification. My suspicion is that segregated blob storage is a useful enabler of the design space, but I'm still searching for that better intuitive understanding of how to tune the rate of GC.
Badger makes the configurable, defaulting to 32 bytes. That seems small in my opinion. My intuition is that something around 1KB is probably right.
32 bytes does seem small to me as well. Are these blobs stored in the block cache like normal data blocks? How does that work?
Compacting blobs down back into ssts is interesting. @ajkr's question about "How do we reason about write-amplification incurred by GC vs. the write-amplification that would've happened without blob separation?" sums up my uncertainty too. Perhaps naively, I would expect that GCing blobs back into an SST would undo a lot of the benefit we get from segregated value storage.
When I first read this proposal I thought you were suggesting that we introduce an extra level of indirection through an sst-backed indexing structure pointing into the blog files. I wonder if there's something to explore there.
Also, I couldn't help but think of https://www.flickr.com/photos/k-80/4663400017 while reading through this.
I missed the issue of moving values and the LSM. Thanks for making that clear to me.
I expect that cache-amp for index+log is worse than you mention above when trying to achieve at most one IO per point query -- but maybe that constraint is less important to you. If value log GC/compaction requires index queries to determine whether a key is live then that is also likely to need a cached index or GC might be slow/inefficient.
My understanding of value log management that you propose is: 1) write value log segments - 1.blog, 2.blog, 3.blog, ...
2) remember the order in which these are created, call this the value log array (VLA)
3) over time merge adjacent entries in VLA that are replaced by merge output
Examples:
This requires a persistent mapping to get from .blog name to current container (.blog or .sst). Assume this is an map from .blog to current file. It would be nice if there were a way to eventually remove entries from that map that are no longer needed. If a .blog file is 64 MB then there are 2^25 .blog files per PB written. Questions about how many PB might be written over the lifetime of database are why I am curious about being able to GC that map.
You don't owe me an answer but if you are building this index+log from scratch, are you sure that an LSM is the best choice for the index given the complexity above (the inability to update the index with the new location of the value, the extra search required to find the value, the size of the map, GC for the map, the need to cache the LSM to make GC queries efficient)?
Note that row cache works better with value log entries when key is stable
My understanding of value log management that you propose is:
Yep, this all looks accurate.
This requires a persistent mapping to get from .blog name to current container (.blog or .sst). Assume this is an map from .blog to current file. It would be nice if there were a way to eventually remove entries from that map that are no longer needed. If a .blog file is 64 MB then there are 2^25 .blog files per PB written. Questions about how many PB might be written over the lifetime of database are why I am curious about being able to GC that map.
The mapping you're describing is simply an array of fileMetadata
akin to a level in the LSM. I believe I describe this above. There are only as many entries in this array as data in the value log.
You don't owe me an answer but if you are building this index+log from scratch, are you sure that an LSM is the best choice for the index given the complexity above (the inability to update the index with the new location of the value, the extra search required to find the value, the size of the map, GC for the map, the need to cache the LSM to make GC queries efficient)?
This isn't actively being built. I wrote the above after trying to understand how other index+log systems work. I'm not convinced it is worthwhile and the expense of GC queries is definitely a concern. The inability to update the location of the value in the LSM is quite the oddity I discovered in the WiscKey approach. But the reason to use an LSM is if the application is already using an LSM and wants blob storage for some reason. One place this could be useful in CockroachDB's usage is for Raft log entries which are written once and read almost never. On the other hand, there might be an even more bespoke structure that is better for Raft log storage (note that in CockroachDB there are 10s of thousands of Raft logs per node, so it isn't feasible to actually use a separate file per Raft log).
I was playing around with Badger today and noticed something: it doesn't perform automatic value log GC. Instead you have to call DB.RunValueLogGC
periodically from your application. That certainly simplifies the "Triggering GC" question, yet it also makes Badger's write throughput and write amplification look better than it would in a full setup.
Many filesystems now provide an interface or another possibility to punch holes into files, that is to make byte ranges in them sparse (freeing up space). For example on a compressed zfs, writing out zeroes means that range will automatically be sparse. Maybe you could add this as an option, instead/along of rewriting the logs, as it would spare a huge amount of read/write io. Because there are more implementations for this, a settable strategy could be used (eg. unset means the current behaviour of rewriting, zeroing means no rewrites and in-place zeroing out deleted ranges, other, platform specific syscalls (Linux, Solaris fallocate) means punching holes programmatically)
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to Pebble!
This is a bit of an abstract idea, but I wonder if there’s a point in the solution space where we combine this with #3453. Abstractly, blob storage can be thought of as dictionary compression storing the dictionary externally without breaking the value apart. I’m suspicious that some usages of storing large values involve significant duplication between values.
Background
The WiscKey paper showed the value of segregating storage for large values in order to reduce write amplification. Badger is a Go KV store implementing that design. RocksDB has support for storing blobs (binary large objects) outside of the main LSM. Unfortunately, this blob-db is incompatible with the normal RocksDB API, making it unusable for us as-is. This issue outlines how I think blob storage should be added to Pebble.
Overview
The base WiscKey idea is that instead of storing large values inline in the LSM, they are stored in a separate value-log. The LSM merely contains a value-pointer. Storing large values outside of the LSM significantly reduces the size of the LSM which improves caching and reduces the overhead of compactions.
The preliminary blob support in RocksDB has outlined the basic structure for how this could work in Pebble. A new
InternalKeyKindBlobIndex
is added indicating that the value for a key is stored elsewhere. The value looks like:(Note that RocksDB also provides support for a TTL on blobs, but I don't think this is usable for CockroachDB and therefore not needed in Pebble).
Blobs are stored in a series of blob logs. A background process periodically walks over the blob logs, reading the entries and checking to see if the value is still "live" in the LSM. This process is called value GC. If the value is still "live", it is written to the active blob log, and the LSM is updated.
Critique
An LSM does not allow mutating existing entries. In order to change the value for a key, a new value needs to be written which masks the old value. In Pebble/RocksDB/Badger, each version of a key/value pair has an associated sequence number (timestamp in Badger). RocksDB/Pebble take advantage of these sequence numbers to provide immutable snapshots. But the immutability of existing values presents a problem for blob storage. How do we update the LSM when a value is moved from one blob log to another? AFAICT, RocksDB's blob-db punts on this issue and just updates the current version. Badger at least attempts to deal with the problem:
The way this logical move is implemented in badger, is to rewrite the key as
!badger!move<key>
(the literal string!badger!move
is prepended to the key). I am not fond of this design as special logic is used on the read-side in order to find the location for a value.The Blob Level
Blob pointers uniquely identify a blob. Blob pointers need to be immutable after they are created, because due to LSM immutability, we can't update the blob pointer without violating a core tenant of an LSM. A key observation: we don't have to update the blob pointer when a blob moves, we just need to be able to find a blob given the original blob pointer even if the blob's actual location changes. A blob pointer is composed of
<file-num,offset>
. File numbers are monotonically increasing and never reused, so we don't have to worry about a blob pointer ever being reused. We just need to provide an index that allows locating a blob by its blob pointer. In an LSM, the readily available indexing structure is an sstable.Putting this together, blobs are stored in a series of logs and sstables. When first created, there is a single log that is being appended to (
blog
is short for blob-log):As blobs get written to this log, they are given blob pointers with
file-num=7
and the offset where they are written in the log. As the log fills up, additional ones are added:A background process periodically loops over these files and rewrites them, only retaining blobs that are still "live" in the LSM. The notable difference from Badger/blob-db, is that rewritten blobs are stored in an sstable. For example, if
000007.blog
contains 2 "live" blobs with blob pointers<7,0>
and<7,1000>
, GC will rewrite these blobs into an sstable where the keys in the sstable are<7,0>
and<7,1000>
, and the values are the blob data. It is likely that multiple adjacent logs / sstables would be processed into a single rewritten sstable. In this example,000007.blog
and000010.blog
could be rewritten into000031.sst
:This sstable+log structure can be viewed as a single level in an LSM. Just as for a normal level, the metadata for this blob level is
[]fileMetadata
which would be persisted to theMANIFEST
along with all other LSM metadata. Locating a blob would involve a binary search for the file containing the blob pointer, just as retrieving a key in the main LSM involves a binary search for the sstable containing the key. If the file containing the blob is a log, we can seek to the desired offset to read the blob. Otherwise we retrieve the blob from the sstable using a normal sstable lookup.Triggering GC
When should GC be triggered? Simply walking through the files in the blob level in order is one approach, but we can do better. We should GC a blob file when it has a significant fraction of "dead" data. We can keep an exact count of this dead data by tracking when blob pointers are dropped during compactions (because a tombstone covering the blob has reached the bottommost level). A compaction would keep a
[]blobPointer
slice that it populates with dropped blob pointers. When the compaction is complete, the dropped blob pointers are sorted and the sorted blob pointers are walked in parallel with the in-memory blob file metadata, updating afileMetadata.deadBytes
field. Recall that this blob file metadata will be persisted to theMANIFEST
as well, so thedeadBytes
field will be precise.A background process would periodically wake up and look at the blob file metadata, looking for contiguous runs of files that will "compact" down to much smaller size. Note that we'll be able to accurately predict both the space savings and the size of the new file.
Jira issue: PEBBLE-183
Epic CRDB-20379