sstable: table-wide prefix compression

jbowens commented 1 year ago

Consider a new Comparer function EqualBytePrefix(a, b []byte) (prefix []byte, ok bool). This function returns a shared byte prefix p of both a and b such that for all x and y:

Compare(p + x, p + y) = Compare(x, y)

In larger databases with larger SQL tables, it's common for all of a compaction's inputs to fall within the same SQL table index. If there are very many rows with a shared prefix, it's reasonably likely that the current key and the largest key of an overlapping compaction input all have a relatively large shared byte prefix, in particular if the user is storing large keys. When starting a new output sstable, we could call EqualBytePrefix(key, c.Largest.UserKey) to determine what shared byte prefix all the remaining keys have, and for which key comparisons could rely solely on the remaining byte suffix.

When writing these sstables, we could stash the EqualBytePrefix somewhere (a property?) and proceed to write only the byte suffix to sstables. During iteration, when decoding keys solely to serve seeks (eg, see the special blockIter.SeekGE code path and its inlined entry decoding), we could perform comparisons solely on the key suffixes encoded in the block. When decoding keys to return up the iterator stack, the blockIter would use its fullKey buffer used for prefix compression to append the block-encoded byte suffix onto the table-wide byte prefix.

This has a few advantages over the block-level prefix compression:

Block-level prefix compression does not apply to index blocks, because we need to be able to seek among all of the keys within an index block, requiring restartInterval=1.
Block-level prefix compression duplicates the shared key prefix every restartInterval (16 keys in CockroachDB).
Key comparisons during seeks are performed on the fully materialized key which adds cost of materializing the key, plus a more expensive key comparison.

I'm unsure in practice how large a shared byte prefix would need to be for the overhead to be worth it, or how frequently we see such keys and sstables in CockroachDB. I think we should examine telemetry, large tpce workloads, etc to estimate whether this is useful in practice.

Jira issue: PEBBLE-59

jbowens commented 10 months ago

@RaduBerinde @dt @sumeerbhola This is an older issue that describes a mechanic that could be repurposed for prefix rewriting.

dt commented 10 months ago

We could also do something like this per-block, i.e. just include the shared prefix once in the block footer and rather than in every restart key.

Tangentially related: I was recently pondering, while looking at a profile of our absurdly expensive estimate disk usage call if it would make sense to push a second concept of "prefix" in to pebble. Perhaps we'd define a second function, similar to Split but which extracts a prefix of the prefix that Split extracts (a RelationPrefix? CommonPrefix? I donno). For CRDB this function would return the bytes that encode the TenantID/TableID/IndexID prefix of keys.

If a CommonPrefix splitter is set, we could then do something like keep usage info by common prefix in some stateful cache of common prefix -> total block size, that is updated on flushes and compactions, and make these estimate stats calls dirt cheap.

jbowens commented 10 months ago

while looking at a profile of our absurdly expensive estimate disk usage call

Is it absurdly expensive? It should be equivalent to approximately two point reads.

dt commented 10 months ago

Sorry, "absurdly expensive" was referring to the overall cost of doing any O(files) work for every request when run on a large cluster, not so much the work we do per file. At least in some of the clusters we were looking at recently, EstimateDiskUsage was taking a minute or more.

If we taught pebble how to predict the spans bounds along which those requests will be made later, it could maintain aggregated span usage info in a stateful map (or have a hook called on flush/compaction that let the store do that crdb side), so we could serve those requests in constant time per store, regardless of number of files.

jbowens commented 10 months ago

Ahh, I see. I think we could also use the manifest.Annotator interface to define a file size annotator over the file metadata B-Tree nodes that make up a level. That's what we use to cheaply compute the "compensated size" of a level for compaction heuristics: https://github.com/cockroachdb/pebble/blob/58bdc725addc9b9175d9dc281c342a673c419370/compaction_picker.go#L732-L764

This would make the calculation logarithmic with respect to the number of files (which seeking already is), and it wouldn't need to be upfront aware of the prefixes/keyspans that will be queried.

Do you happen to have the cpu profile still? Maybe we should make an issue.

petermattis commented 10 months ago

I was thinking about a similar idea this morning in the context of Online Restore, though with a slightly different tack: materializing the full key on demand in blockIter and not trying to use the common shared prefix to reduce the number of bytes compared (though that could still be done). (Note the description below contains overlap with @jbowens original description. I happened to develop this independently before stumbling across this issue.)

Most keys stored in sstables already have the common shared prefix for the sstable stripped due to the prefix compression present in sstable blocks. Assuming the default restart interval of 16, then 93% (15/16) of keys (as stored on disk in blocks) have their common shared prefix for the sstable stripped.
A blockIter can be initialized with an additional sharedPrefix []byte parameter that indicates the specified prefix should be prepended to every key in the block. In order to support this, blockIter.sharedPrefix is copied into blockIter.fullKey at initialize time. In blockIter.readEntry, blockIter.nextPrefixV3 and blockIter.readFirstKey, after decoding the shared prefix length for the entry we increment it: shared += len(blockIter.sharedPrefix). I think everything else in block iteration just works after that change. [Update: actually, we need to change SeekGE and SeekLT as well).
sstable.Writer will need to be enhanced to take an option specifying the shared prefix for all keys. When keys are added the prefix would be checked and stripped so data blocks and index blocks would not contain the key. The shared prefix itself would be written into a property. Note that stripping the prefix as keys are added means the prefix is also stripped from keys added to the table's bloom filter.
sstable.Reader will need to read the shared prefix property and properly initialize blockIters. tableFilterReader would need to be extended to contain the shared prefix, and check and strip it from keys inside mayContain. Or this could be done with another FilterPolicy implementation wrapping bloom.FilterPolicy. The end keys of range deletions (i.e. the value for a range delete key) needs special handling. I think the most straightforward way to handle these is to transform the range delete block on read, similar to transformRangeDelV1.
During compactions, Pebble can conservatively compute the shared prefix for an sstable by comparing the smallest start key and largest end key for the inputs to the compaction.

As noted above, keys have the shape of [/TenantID]/TableID/IndexId. I'm pretty confident that many sstables only contain keys from a single table/index. (We could check this on cloud or in our test fixtures easily by looking at MANIFEST entries).

The above makes replacing the shared prefix for Online Restore "easy":

By stripping the common prefix from keys stored in sstables we can nearly trivially support rewriting that common prefix: just transform sstable.Reader.sharedPrefix when the sstable.Reader is opened. There is slightly more work to be done for the table filter handling if we're not completely replacing the existing prefix. When exporting sstables for backup we can compute the shared prefix for the table (because we know the range of keys we're exporting). Note that the shared prefix may be larger than just [/TenantID]/TableID/IndexId (i.e. it could contain a shared column value), but that should work without additional complexity.
When ingesting a file we specify a prefix replacement rule ala https://github.com/cockroachdb/pebble/pull/3159. This prefix replacement gets stored in the MANIFEST and then passed down to sstable.Reader as an open option.

I like that this approach makes sstable-level prefix compression/materialization the common path. I'm not seeing any significant hurdle to implementation and believe that there will be minimal (near-zero) performance overhead at the blockIter level. The downsides to this approach are that we'll need a new sstable format and Online Restore will only work on sstables using this new format.

cockroachdb / pebble

sstable: table-wide prefix compression #2632