storage: improve MVCC benchmark data

jbowens commented 1 year ago

The MVCC benchmarks in the storage package build up an initial state of a database to run against. These databases are very unrepresentative of real-world LSMs.

For example, the 100,000 key, 1 version-per-key, 64-byte value variant produced a LSM consisting of just 2 sstables in L6, resulting in a read amplification of 1. The 100,000 key, 100 versions-per-key, 64-byte value variant produced a slightly more representative LSM with 3 non-empty levels, L6, L5 and L0. L0 had two sublevels, resulting in a read amplification of 4.

Building truly representative LSMs would be prohibitively slow, but using smaller target file sizes or carefully constructing the database to force additional LSM levels would likely have performance characteristics more inline with a realistic Cockroach LSM than we currently have.

Once we have a corpus of compaction benchmarking workloads (cockroachdb/pebble#1865), including initial LSMs, we could switch to relying more on microbenchmarks that we run manually with a mounted, pre-collected LSM.

Jira issue: CRDB-22532

blathers-crl[bot] commented 1 year ago

Hi @jbowens, please add a C-ategory label to your issue. Check out the label system docs.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

jbowens commented 1 year ago

cc @erikgrinaker, heads up that the MVCCScan benchmark (especially with smaller version counts) kinda sucks.

cockroachdb / cockroach

storage: improve MVCC benchmark data #93795