cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

storage: expose top-N ranges by deletions #102645

Open nicktrav opened 1 year ago

nicktrav commented 1 year ago

Is your feature request related to a problem? Please describe.

Currently, it is difficult to deduce whether a range that has undergone MVCC GC still has a large number of deletions that are waiting to be cleared out via a Pebble compaction. The presence of these deletions can impact query performance, as a scan / get would need to step through deleted keys.

Describe the solution you'd like

Introduce an internal builtin that can be used to query the top-N ranges by deletions.

For example:

crdb_internal.engine_deletions_top_n(node, store, N, [start, end])

Where N is the number of ranges to return. The start and end args accept the hex representations of span start and end keys, respectively. Only ranges that overlap the start / end bounds are considered (allowing for filtering on a table, index, etc.). If the start / end keys are omitted, the entire keyspace is considered.

The output would be something like range ID, along with columns for point, range and range key deletion counts for the SSTables that make up the range.

Under the hood, Pebble can use its in-memory state of table stats to compute the counts. If an SSTable partially overlaps the bounds, linear interpolation can be used to infer the counts.

Describe alternatives you've considered

This could also be achieved via something like what has been proposed in #94659 and cockroachdb/pebble#1996.

Jira issue: CRDB-27569

nicktrav commented 1 year ago

This would likely be solved if we went with something like the virtual table outlined in #102604.

Keeping this open for now, but let's revisit once #102604 is done.