cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.83k stars 451 forks source link

sstable: table-wide prefix compression #2632

Open jbowens opened 1 year ago

jbowens commented 1 year ago

Consider a new Comparer function EqualBytePrefix(a, b []byte) (prefix []byte, ok bool). This function returns a shared byte prefix p of both a and b such that for all x and y:

Compare(p + x, p + y) = Compare(x, y) 

In larger databases with larger SQL tables, it's common for all of a compaction's inputs to fall within the same SQL table index. If there are very many rows with a shared prefix, it's reasonably likely that the current key and the largest key of an overlapping compaction input all have a relatively large shared byte prefix, in particular if the user is storing large keys. When starting a new output sstable, we could call EqualBytePrefix(key, c.Largest.UserKey) to determine what shared byte prefix all the remaining keys have, and for which key comparisons could rely solely on the remaining byte suffix.

When writing these sstables, we could stash the EqualBytePrefix somewhere (a property?) and proceed to write only the byte suffix to sstables. During iteration, when decoding keys solely to serve seeks (eg, see the special blockIter.SeekGE code path and its inlined entry decoding), we could perform comparisons solely on the key suffixes encoded in the block. When decoding keys to return up the iterator stack, the blockIter would use its fullKey buffer used for prefix compression to append the block-encoded byte suffix onto the table-wide byte prefix.

This has a few advantages over the block-level prefix compression:

  1. Block-level prefix compression does not apply to index blocks, because we need to be able to seek among all of the keys within an index block, requiring restartInterval=1.
  2. Block-level prefix compression duplicates the shared key prefix every restartInterval (16 keys in CockroachDB).
  3. Key comparisons during seeks are performed on the fully materialized key which adds cost of materializing the key, plus a more expensive key comparison.

I'm unsure in practice how large a shared byte prefix would need to be for the overhead to be worth it, or how frequently we see such keys and sstables in CockroachDB. I think we should examine telemetry, large tpce workloads, etc to estimate whether this is useful in practice.

Jira issue: PEBBLE-59

jbowens commented 10 months ago

@RaduBerinde @dt @sumeerbhola This is an older issue that describes a mechanic that could be repurposed for prefix rewriting.

dt commented 10 months ago

We could also do something like this per-block, i.e. just include the shared prefix once in the block footer and rather than in every restart key.

Tangentially related: I was recently pondering, while looking at a profile of our absurdly expensive estimate disk usage call if it would make sense to push a second concept of "prefix" in to pebble. Perhaps we'd define a second function, similar to Split but which extracts a prefix of the prefix that Split extracts (a RelationPrefix? CommonPrefix? I donno). For CRDB this function would return the bytes that encode the TenantID/TableID/IndexID prefix of keys.

If a CommonPrefix splitter is set, we could then do something like keep usage info by common prefix in some stateful cache of common prefix -> total block size, that is updated on flushes and compactions, and make these estimate stats calls dirt cheap.

jbowens commented 10 months ago

while looking at a profile of our absurdly expensive estimate disk usage call

Is it absurdly expensive? It should be equivalent to approximately two point reads.

dt commented 10 months ago

Sorry, "absurdly expensive" was referring to the overall cost of doing any O(files) work for every request when run on a large cluster, not so much the work we do per file. At least in some of the clusters we were looking at recently, EstimateDiskUsage was taking a minute or more.

If we taught pebble how to predict the spans bounds along which those requests will be made later, it could maintain aggregated span usage info in a stateful map (or have a hook called on flush/compaction that let the store do that crdb side), so we could serve those requests in constant time per store, regardless of number of files.

jbowens commented 10 months ago

Ahh, I see. I think we could also use the manifest.Annotator interface to define a file size annotator over the file metadata B-Tree nodes that make up a level. That's what we use to cheaply compute the "compensated size" of a level for compaction heuristics: https://github.com/cockroachdb/pebble/blob/58bdc725addc9b9175d9dc281c342a673c419370/compaction_picker.go#L732-L764

This would make the calculation logarithmic with respect to the number of files (which seeking already is), and it wouldn't need to be upfront aware of the prefixes/keyspans that will be queried.

Do you happen to have the cpu profile still? Maybe we should make an issue.

petermattis commented 10 months ago

I was thinking about a similar idea this morning in the context of Online Restore, though with a slightly different tack: materializing the full key on demand in blockIter and not trying to use the common shared prefix to reduce the number of bytes compared (though that could still be done). (Note the description below contains overlap with @jbowens original description. I happened to develop this independently before stumbling across this issue.)

As noted above, keys have the shape of [/TenantID]/TableID/IndexId. I'm pretty confident that many sstables only contain keys from a single table/index. (We could check this on cloud or in our test fixtures easily by looking at MANIFEST entries).

The above makes replacing the shared prefix for Online Restore "easy":

I like that this approach makes sstable-level prefix compression/materialization the common path. I'm not seeing any significant hurdle to implementation and believe that there will be minimal (near-zero) performance overhead at the blockIter level. The downsides to this approach are that we'll need a new sstable format and Online Restore will only work on sstables using this new format.