Open irfansharif opened 4 years ago
@petermattis: How much of this work do you reckon falls under storage? We're thinking of this as a potential starter intern project for KV, and the 'ripping out' the compactor queue portion of it all is doable enough, but I'm not sure if we'll end up putting something back in into Pebble. Thoughts?
I think this almost entirely lands on the storage team. @jbowens is actively doing the work inside of Pebble (which is significant in size), and I was expecting to leave the glory of ripping out the Compactor Queue to him.
@petermattis Seems like the actual removal of the compactor queue would need to wait until the removal of RocksDB? Is that true?
@petermattis Seems like the actual removal of the compactor queue would need to wait until the removal of RocksDB? Is that true?
Correct, though we could arrange for the compactor queue to only be enabled for RocksDB. It is probably also worthwhile to get rid of the compactor queue metrics from the admin UI. Those are usually a source of confusion as people conflate them with the Pebble/RocksDB compaction metrics.
The Compactor queue is disabled for Pebble, and the graphs have been removed regardless of storage engine in 20.2. One bit left to do here is to figure out if we should add additional graphs around compactions. For example, @itsbilal's newly added metric around in-progress compaction disk usage.
@bananabrick this popped on our radar as something you might be interested in taking on / thinking about as a part of the compaction work you have in progress. happy to chat more offline
...and possibly delete the Compactor Queue as it exists today.
tl;dr: The Compaction Queue chart we expose through our UI is not a very useful chart to be looking at, and we could do better.
The Compactor is the mechanism we have in place today that allows us to suggest compactions, on demand, to the underlying storage engine. We typically make use of this when we know we are generating a lot of garbage (for instance when a store accepts a bunch of new replicas that overlap with existing ones during a decommissioning process). The Compactor periodically goes through received suggestions and instructs RocksDB/Pebble to compact data on disk, as appropriate. Note that this is not strictly necessary, RocksDB/Pebble will carry out compactions over time as needed, the Compactor exists to proactively reclaim space when possible.
The Compaction Queue graph, perhaps confusingly, records the view of the world as seen by the Compactor, not as seen by RocksDB/Pebble. So a "suggestion" to the compactor is recorded in
queued bytes
(as seen in the UI at the time of writing). It's only when the compactor oversees the processing of what was suggested to it, does it decrement from thequeued bytes
metric. It does not periodically poll the underlying storage engine to reflect what RocksDB/Pebble thinks this value should be (say, "estimated reclaimable space"), it's only recording the state of the suggestions received thus far. This does not seem to be a useful metric to be tracking. It also only updates the metric on demand when it receives new compaction suggestions. It also does not react to changing cluster settings pertaining to the Compactor (compactor.{max_record_age,threshold_{bytes,{available,used}_fraction}}
).In https://github.com/cockroachlabs/support/issues/385 we observed a supposedly "wedged" compaction graph which was in fact simply out of date, and not updating itself as it hadn't received any compaction suggestions for some time. Because all the suggested compactions were fractured/small, and thus inactionable, the graph persistedly displayed a high
queued bytes
amount.For the reasons above, I think what we want is closer to https://github.com/cockroachdb/cockroach/issues/41265 and https://github.com/cockroachdb/cockroach/issues/43965, possibly exposing
rocksdb.estimated-pending-compaction
as a first class UI citizen instead (and/or the Pebble equivalent). The Compaction Queue graph, as it stands today, offers no visibility into anything we would be interested in (and is also usually of date).As for the removal of the Compactor Queue in its entirety, I think it was introduced as an attempt to reclaim garbage on demand/control RocksDB compaction behavior, but I'm not sure if (a) we need such a thing, and (b) it's effective at doing said thing. Seems to me if we have problems around garbage reclamation, we should be addressing them at the storage layer, not at KV.
We currently persist received suggestions if we're unable to act on them immediately, in the hope that future suggestions over larger intervals can be merged alongside it. I'm unsure if this happens often, or if it does, when. Suggestions are also deleted after 24hrs (coming back to (b), the effectiveness of it all).
Jira issue: CRDB-5091