perf: heuristics to reduce impact of point tombstones on read latency

jbowens commented 4 years ago

High densities of point tombstones slow down iteration, wasting IO on loading keys that are discarded and cpu time skipping over them.

In RocksDB, sstables that contain majority deletion entries are prioritized for compaction through inflating these tables' compensated sizes. ~In Pebble, we never adjust compaction priorities to compact point tombstones.~ (Update: In #1018 we adjusted compaction heuristics in way similar to RocksDB to increase the priority of compacting these point tombstones.) We should investigate if any CockroachDB workloads accumulate high densities of point tombstones and those densities affect the workload in practice. If there is an impact, we might try triggering compactions of files based on the frequency with which an iterator skips over point tombstones in the file.

Related to #29, which also triggers compactions from reads but aimed at reducing read amplification.

Jira issue: PEBBLE-214

github-actions[bot] commented 2 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to Pebble!

jbowens commented 2 years ago

I updated the issue description to capture the fact that since #1018, Pebble does incorporate point tombstones into compaction heuristics by incorporating them into 'compensated size.'

It's still possible that read-triggered compactions of point tombstones would be beneficial, especially for point tombstones in a part of the keyspace where the average value size is small (resulting in less contribution to compensated size). The internal iterator stats added in #1538 can help us evaluate if the existing heuristics are enough.

sumeerbhola commented 1 year ago

This could have helped in https://github.com/cockroachlabs/support/issues/2107#issuecomment-1452660323

jbowens commented 1 year ago

This could have helped in cockroachlabs/support#1830.

jbowens commented 1 year ago

I think this came up in another issue, but a more generalized approach could use CPU attribution to determine which sstables to compact. Seems tricky but possible?

sumeerbhola commented 1 year ago

I think this is also coming up with rangedels that we are just skipping over in https://github.com/cockroachlabs/support/issues/2203#issuecomment-1498078068

andrewbaptist commented 1 year ago

Can we add "IO attribution" to ranges? This would be similar to CPU attribution in the sense that it would use lower-level calls to determine how much each individual range is reading/writing per request and the KV layer could store and expose that information both at the range and node level.

The general idea as we are attempting to keep systems balanced is that having more information can help us make better decisions.

I think the way to do this best is to pass a "IOStats" object into all the mvcc operations that update it on the operation.

nicktrav commented 1 year ago

Can we add "IO attribution" to ranges?

Certainly. We have cockroachdb/cockroach#98377 already for this, which is planned for the 23.2 cycle.

I think the way to do this best is to pass a "IOStats" object into all the mvcc operations that update it on the operation.

Shall we copy those suggestions over on cockroachdb/cockroach#98377 too. Perhaps we can make that issue more generalized?

andrewbaptist commented 1 year ago

That is a slightly different ask. What I mean specifically is that if call to MVCCGet would track how much IO it did (tracked in bytes read/written) this could either be 1 or 2 stats. So for instance the MVCCGetResult would add a new field IOStats that has this information. I would expect this to be somewhat correlated with CPU but if systems are "IO starved" we may want to make rebalancing decisions based on "IO heavy ranges" rather than "CPU heavy".

If we implement #98377 we should also add the IOStats related to compaction load.

Initially this information would just show up in the "hot ranges" page, but this is another way a range can be hot. We have seen cases where compaction doesn't happen on clusters since they are primarily write only, but the few ranges that do have updates end up with very expensive IO operations to complete. Also since compaction is a "replica level" decision, not a "range level", different stores have very different IO profiles for the same range. Having this visibility could be very useful.

sumeerbhola commented 1 year ago

We already have InternalIteratorStats.{BlockBytes,BlockBytesInCache} which are reported to KV (and are in traces too). One just needs to aggregate this per range. The storage package does not know about ranges, so I don't think the aggregation should happen there.

jbowens commented 1 year ago

Uncompacted point tombstones have several effects:

Increased CPU — During iteration we must skip over the tombstones and the keys they delete.
Increased I/O — Tombstones dilute blocks, making a single block load incur more I/O per useful KVs. This impacts scans that load more than one block, or if dense enough, can impact gets by forcing more than 1 data block load per level.

The effect on I/O is visible in the telemetry cluster over the past month. Two stores have an outsized accumulation of point tombstones, and the sum of uncached block loads mirrors the graph of tombstones:

Interestingly, the graph of total block loads (cached + uncached) does not show any noticeable trend. I'm not sure why, other than maybe it's a consequence of where in the keyspace these point tombstones are and the corresponding workload.

The I/O effects should be reduced by our space-amplification metrics, especially with improved accuracy from recent improvements like (#2340). These heuristics strive to keep the space amplification of point tombstones down to 10% of the overall LSM size. However, tombstones might not be evenly distributed across the keyspace. The present situation on the Telemetry cluster suggests this. So 10% amplification concentrated in one region of the keyspace may still increase tail latencies intolerably.

So in general, I think our goal here is to reduce the densities of tombstones, and in particular in areas of the keyspace that are read. The I/O effects are more pronounced on the sstable containing the deleted keys, rather than the tombstones, since they may encode values as well. As a result, I think this would be tricky to incorporate into a heuristic. Like read-triggered compactions we could try to reduce densities of tombstones only when they're observed by an iterator. But point tombstones can be disruptive to latencies when they're not read frequently, because the reads will often find the blocks absent from the cache (like apparently is the case in the Telemetry cluster).

I'm thinking we should instead focus on reducing tombstone densities, irrespective of iteration. Still thinking

jbowens commented 1 year ago

I've seen repeatedly that new nodes seeded with replication snapshots or new clusters seeded with an IMPORT are susceptible to accumulating a large quantity of point tombstones because their LSMs are too well shaped. We won't pursue any compactions from L5 to L6 because L5 is way too small. We won't elide any point tombstones until we begin to trigger L5->L6 compactions. Ignore the gap in data, but these graphs are illustrative:

Another consideration (from this slack thread) is that tombstones can reduce block cache locality for reads, even if the tombstones are outside the scope of reads. With expiration-based leases we have N hot keys that are repeatedly read and written. Sandwiched between these N hot keys, we have the raft log that's continually accumulating entries and then truncating them with point tombstones. I believe there are typically very few live keys between these N hot keys. But accumulation of point tombstones within the raft log can push these N hot keys into N separate ssblocks, trashing the block cache with blocks that are only useful for reading these single keys. This can happen while never reading the raft log and never surfacing the point tombstones to the pebble.Iterator.

jbowens commented 1 year ago

This could have helped in cockroachlabs/support#2628.

nicktrav commented 1 year ago

Linking cockroachlabs/support#2640, relevant to prioritizing compactions of the liveness range.

anish-shanbhag commented 4 months ago

Adding some initial results from the benchmark in #2657 for future reference:

BenchmarkPointDeletedSwath
    iterator_test.go:2865: Populating keyspace with 12356630 keys, each with 256-byte values
...
BenchmarkPointDeletedSwath/gap=100/prefix_point_lookup-10             175282          5870 ns/op          72 B/op          4 allocs/op
BenchmarkPointDeletedSwath/gap=100/non-prefix_point_seek-10           150981          7811 ns/op          72 B/op          4 allocs/op
BenchmarkPointDeletedSwath/gap=100/full_scan-10                            1    1771725125 ns/op       36552 B/op        466 allocs/op
BenchmarkPointDeletedSwath/gap=1000/prefix_point_lookup-10            190581          6176 ns/op          72 B/op          4 allocs/op
BenchmarkPointDeletedSwath/gap=1000/non-prefix_point_seek-10          156277          7251 ns/op          72 B/op          4 allocs/op
BenchmarkPointDeletedSwath/gap=1000/full_scan-10                           1    1787702625 ns/op       35824 B/op        450 allocs/op
BenchmarkPointDeletedSwath/gap=10000/prefix_point_lookup-10           184526          6061 ns/op          72 B/op          4 allocs/op
BenchmarkPointDeletedSwath/gap=10000/non-prefix_point_seek-10         139006          8142 ns/op          72 B/op          4 allocs/op
BenchmarkPointDeletedSwath/gap=10000/full_scan-10                          1    1794427709 ns/op       34584 B/op        362 allocs/op
BenchmarkPointDeletedSwath/gap=400000/prefix_point_lookup-10          183644          6184 ns/op          75 B/op          4 allocs/op
BenchmarkPointDeletedSwath/gap=400000/non-prefix_point_seek-10          4581       1332878 ns/op         221 B/op          5 allocs/op
BenchmarkPointDeletedSwath/gap=400000/full_scan-10                         1    1777834667 ns/op      639288 B/op       6896 allocs/op
BenchmarkPointDeletedSwath/gap=1000000/prefix_point_lookup-10         177470          6248 ns/op          75 B/op          4 allocs/op
BenchmarkPointDeletedSwath/gap=1000000/non-prefix_point_seek-10          200       9114459 ns/op        3520 B/op         41 allocs/op
BenchmarkPointDeletedSwath/gap=1000000/full_scan-10                        1    1836996416 ns/op      680728 B/op       7317 allocs/op
BenchmarkPointDeletedSwath/gap=10000000/prefix_point_lookup-10        116256          9893 ns/op          82 B/op          4 allocs/op
BenchmarkPointDeletedSwath/gap=10000000/non-prefix_point_seek-10           1    1865234250 ns/op     1262752 B/op      13097 allocs/op
BenchmarkPointDeletedSwath/gap=10000000/full_scan-10                       1    2557187541 ns/op     1262584 B/op      13148 allocs/op

SeekPrefixGE and full scans only have a ~50-90% slowdown even with a swath of 10 million tombstones, while SeekGE exponentially slows down as the tombstone count increases. It looks like a critical point is between 10,000 and 400,000 tombstones, where we see a slowdown from 8142 ns/op -> 1332878 ns/op

cockroachdb / pebble

perf: heuristics to reduce impact of point tombstones on read latency #918