storage: dropping a large table will brick a cluster due to compactions

benesch commented 6 years ago

$ roachprod create USER-FOO -n10
$ roachprod run USER-FOO 'mkdir -p /mnt/data1/cockroach && gsutil -m -q cp -r gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/stores=10/$((10#$(hostname | grep -oE [0-9]+$)))/* /mnt/data1/cockroach'

Wait ~10m for stores to download. Then drop the 2TiB table:

ALTER TABLE bank.bank EXPERIMENTAL CONFIGURE ZONE 'gc: {ttlseconds: 30}';
DROP TABLE bank.bank;

The cluster explodes a few minutes later as RocksDB tombstones pile up. I can no longer execute any SQL queries that read from/write to disk.

Very closely related to #21901, but thought I'd file a separate tracking issue.

/cc @spencerkimball

benesch commented 6 years ago

Ok, I'm counting 22k range deletion tombstones in aggregate:

$ roachprod run benesch-drop-2 -- grep -a -A 10000 "'Range deletions:'" /mnt/data1/cockroach/*.txt \| grep -a HEX \| wc -l | tail -n+2 | cut -f2 -d: | paste -sd+ - | bc
22194

Broken down by node:

$ roachprod run benesch-drop-2 -- grep -a -A 10000 "'Range deletions:'" /mnt/data1/cockroach/*.txt \| grep -a HEX \| wc -l 
benesch-drop-2: grep -a -A 10000 'Range del... 10/10
   1: 61
   2: 2134
   3: 2342
   4: 2292
   5: 2252
   6: 5080
   7: 34
   8: 862
   9: 2992
  10: 4145

I'm surprised there's such a high variance—the replicas looked balanced when I dropped the table.

bdarnell commented 6 years ago

Breaking it down on a per-sst basis, most of the 4k ssts have zero tombstones. There are a handful that have hundreds:

007620_dump.txt     1
007840_dump.txt     1
008611_dump.txt     1
008641_dump.txt     1
008681_dump.txt     1
008703_dump.txt     1
008826_dump.txt     1
008822_dump.txt     106
008817_dump.txt     112
007623_dump.txt     2
008596_dump.txt     2
008820_dump.txt     241
008818_dump.txt     247
007615_dump.txt     3
008824_dump.txt     3
008728_dump.txt     4
008816_dump.txt     4
008823_dump.txt     4
008825_dump.txt     6
008814_dump.txt     632
008819_dump.txt     662
008815_dump.txt     7
008821_dump.txt     92

Digging into individual values on 008819, we see that the end key of one range and the start key of the next are adjacent in cockroach terms but different at the rocksdb level. This is the start and end key of two adjacent tombstones:

{/Table/51/1/8035458/0 0.000000000,0}
{/Table/51/1/8036092 0.000000000,0}
{/Table/51/1/8036092/0 0.000000000,0}
{/Table/51/1/8039365 0.000000000,0}

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

tbg commented 6 years ago

Let's make sure this becomes a variant of the drop roachtest!

benesch commented 6 years ago

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

Do you think that would help that much? A sufficiently scattered table can always produce maximally discontiguous ranges.

bdarnell commented 6 years ago

True, but empirically it would make a big difference in this case. A very large fraction of the range tombstones in 008819 are contiguous (or would be with this change).

Another upstream fix would be for rocksdb to prioritize sstables with multiple range tombstones for compaction. These are likely to be easy to compact away to the lowest level, and leaving them in higher levels is disproportionately expensive.

bdarnell commented 6 years ago

Note that I'm speculating about this - I haven't found any code in rocksdb that would join two consecutive range deletions. I'm not sure if it's possible - each range tombstone has a sequence number as a payload, but I think it may be possible for those to be flattened away on compaction.

benesch commented 6 years ago

range_del_aggregator.cc has a collapse_deletions_ flag, but I haven't been able to trace what sets it.

bdarnell commented 6 years ago

The CPU profile showed that we're spending all our time inside the if (collapse_deletions_) block, so it's definitely getting set. But I've been looking at what that flag does and if it joins two adjacent ranges I can't see where it does so.

tbg commented 6 years ago

I experimented with a change that replaced the ClearRange with a (post-Raft, i.e. not WriteBatched) ClearIterRange. Presumably due to the high parallelism with which these queries are thrown at RocksDB by DistSender, the nodes ground to a halt, missing heartbeats and all. So this alone isn't an option. I'll run this again with limited parallelism but that will likely slow it down a lot.

benesch commented 6 years ago

If you have time on your hands, you might also want to experiment with turning on the CompactOnDeletionCollectionFactory. Not my hacked together version for range deletion tombstones, but the official upstream one for triggering compactions whenever an SST contains too many normal deletion tombstones within a window: https://github.com/facebook/rocksdb/blob/master/utilities/table_properties_collectors/compact_on_deletion_collector.h

On Wed, Mar 21, 2018 at 11:53 AM, Tobias Schottdorf < notifications@github.com> wrote:

I experimented with a change that replaced the ClearRange with a (post-Raft, i.e. not WriteBatched) ClearIterRange. Presumably due to the high parallelism with which these queries are thrown at RocksDB by DistSender, the nodes ground to a halt, missing heartbeats and all. So this alone isn't an option. I'll run this again with limited parallelism but that will likely slow it down a lot.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-374989269, or mute the thread https://github.com/notifications/unsubscribe-auth/AA15IKz_0Rdmv09pwIUVHNFwIXLNyanZks5tgndtgaJpZM4SwswS .

benesch commented 6 years ago

Oooh, check this out: https://github.com/facebook/rocksdb/pull/3635/files

tbg commented 6 years ago

Good find! I'll try those out.

Ping https://github.com/cockroachdb/cockroach/issues/17229#issuecomment-362483650

tbg commented 6 years ago

I tried running with https://github.com/facebook/rocksdb/pull/3635/files (on top of stock master) and it doesn't help. Not sure whether it's because of a bug in the code or because of @benesch's observation that it doesn't really address our problem; either way we see goroutines stuck in RocksDB seeks for >20 minutes at a time (they eventually come back and go into the next syscall, as far as I can tell, but it's clearly not anywhere close to working).

I then added @benesch's PR on top and restarted the nodes in the hope that I would see compaction activity while the node is stuck in initialization. However, that doesn't seem to happen; we only see one single thread maxing out one CPU:

tbg commented 6 years ago

Back to the ClearIterRange version, even with an added CompactOnDeletionCollectionFactory it's pretty bad (which makes sense, we've added more work):

root@localhost:26257/> create database foo;
CREATE DATABASE

Time: 1m49.87228088s

goroutine 52246 [syscall, 15 minutes]:
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBIterSeek(0x7f4e511e5100, 0xc4254a2628, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    _cgo_gotypes.go:549 +0x74
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek.func2(0x7f4e511e5100, 0xc4254a2628, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1905 +0xba
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek(0xc42a9b87e0, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1905 +0x16b
github.com/cockroachdb/cockroach/pkg/storage/engine.dbIterate(0x7f4e510990c0, 0x2657fc0, 0xc42be3fb00, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0, 0xc4254a2640, 0x7, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:2391 +0x166
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Iterate(0xc42be3fb00, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0, 0xc4254a2640, 0x7, 0x8, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1548 +0xfc
github.com/cockroachdb/cockroach/pkg/storage/batcheval.ClearRange(0x26486a0, 0xc42f06f440, 0x7f4e65c12328, 0xc42be3fb00, 0x26701c0, 0xc424c7c000, 0x151e0641f6a360bb, 0x0, 0x100000001, 0x3, ...)

spencerkimball commented 6 years ago

It might not be a popular suggestion, but we could explore reinstating the original changes to simply drop SSTables for large aggregate suggested compactions. This requires that we set an extra flag on the suggestion indicating its a range of data which will never be rewritten. This is true after table drops and truncates, but not true after rebalances.

spencerkimball commented 6 years ago

Or am I misunderstanding, and dropping the files would be independent of the range tombstone problem?

tbg commented 6 years ago

I was thinking about that too, I'm just spooked by the unintended consequences this can have as it pulls out the data even from open snapshots.

spencerkimball commented 6 years ago

Do you expect it would alleviate the problem? I have a PR that does it if you want to try.

On Wed, Mar 21, 2018 at 5:28 PM Tobias Schottdorf notifications@github.com wrote:

I was thinking about that too, I'm just spooked by the unintended consequences this can have as it pulls out the data even from open snapshots.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-375101757, or mute the thread https://github.com/notifications/unsubscribe-auth/AF3MTY6b2vN-YRJ0OR0tZDRZsTeV9O3Vks5tgsXsgaJpZM4SwswS .

tbg commented 6 years ago

It's likely that it would alleviate the problem, but actually introducing this is not an option that late in the 2.0 cycle, so I'm investing my energies elsewhere. That said, if you want to try this, go ahead! Would be fun to see it in action.

lingbin commented 6 years ago

Like I said in the ISSUE facebook/rocksdb#3634, That PR can only solve non-first seek performance issues, so @benesch is right, the PR facebook/rocksdb#3635 cannot solve your problem because cockroachdb requires multiple iterators, and then each iterator only does a small amount of seeks.

For this issue, I have some immature suggestions for consideration：

A table can contain multiple ranges, when dropping table, one-time delete the entire table, rather than one by one to delete range, which can reduce the number of tombstone. But I think this can only alleviate the problem, because this large range may split, when the rocksdb does a level compaction. And it may be difficult to do a merge deletion in a re-balance scenario.
If there is a mechanism guarantee, the dropped range can only be truely deleted, I.E., call DeleteRange() of rocksdb, if it is guaranteed to not be accessed anymore. For example, using a delayed deletion mechanism, such as delaying a transaction timeout? You can set ReadOptions.ignore_range_deletions to true, which will ignore range tombstones. Because you no longer need to access the deleted data, it is safe to ignore it.

tbg commented 6 years ago

Thanks @lingbin. We're aware of these options, but it's tricky for us to implement them because we've so far relied on RocksDB giving us consistent snapshots, and we have various mechanisms that check that the multiple copies of the data we keep are perfectly synchronized. That said, somewhere down the road we're going to have to use some technique that involves either vastly improving the performance of deletion tombstone seeks (what @benesch said upstream), DeleteFilesInRange, and/or skipping deletion tombstones.

tbg commented 6 years ago

@bdarnell, re the below:

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

For a hail-mary short term fix, this seems to be a promising venue if it's true. Were you ever able to figure out if that should be happening? At least for testing, I can make these holes disappear, but it would be good to know if it's any good in the first place.

I'm also curious what the compactions necessary to clean up this table would do even if seeks weren't affected at all. Running a manual compaction takes hours on this kind of dataset; we were implicitly assuming that everything would just kind of evaporate and it would be quick, but it doesn't seem to be what's happening here. Maybe this is due to the "holes" we leave in the keyspace, but it's unclear to me. We need this to be reasonably efficient or the ClearRange option isn't one at all, even without the seek problem. Something to investigate.

Ping @a-robinson just in case anything from https://github.com/cockroachdb/cockroach/issues/21528 is applicable here. I believe the dataset there was much smaller, right?

In other news, I ran the experiment with a) ClearIterRange b) tombstone-sensitive compaction and the cluster was a flaming pile of garbage for most of the time, but came out looking good:

I'm not advocating that that should be our strategy for 2.0, but it's one of the so far equally bad strategies.

benesch commented 6 years ago

Thinking more about ClearRange coalescence, it’s not clear how RocksDB would manage it when they come in as separate commands. If you have ClearRange [a, b) @ t1 and ClearRange [b, c) @ t3, you'd need to preserve both of them for proper handling of any keys @ t2.

On Thu, Mar 22, 2018 at 12:15 PM Tobias Schottdorf notifications@github.com wrote:

@bdarnell https://github.com/bdarnell, re the below:

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

For a hail-mary short term fix, this seems to be a promising venue if it's true. Were you ever able to figure out if that should be happening? At least for testing, I can make these holes disappear, but it would be good to know if it's any good in the first place.

I'm also curious what the compactions necessary to clean up this table would do even if seeks weren't affected at all. Running a manual compaction takes hours on this kind of dataset; we were implicitly assuming that everything would just kind of evaporate and it would be quick, but it doesn't seem to be what's happening here. Maybe this is due to the "holes" we leave in the keyspace, but it's unclear to me. We need this to be reasonably efficient or the ClearRange option isn't one at all, even without the seek problem. Something to investigate.

Ping @a-robinson https://github.com/a-robinson just in case anything from #21528 https://github.com/cockroachdb/cockroach/issues/21528 is applicable here. I believe the dataset there was much smaller, right?

In other news, I ran the experiment with a) ClearIterRange b) tombstone-sensitive compaction and the cluster was a flaming pile of garbage for most of the time, but came out looking good:

[image: image] https://user-images.githubusercontent.com/5076964/37782015-46fbf22c-2dc8-11e8-90d2-fc851ffe1f40.png

I'm not advocating that that should be our strategy for 2.0, but it's one of the so far equally bad strategies.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-375365121, or mute the thread https://github.com/notifications/unsubscribe-auth/AA15IK9gUo5GPcEA1d2XmGKF4hltzsR6ks5tg848gaJpZM4SwswS .

tbg commented 6 years ago

RocksDB would keep track of the lowest sequence number that's still open in a snapshot and at compaction time, it joins SSTables whose sequence numbers are irrelevant.

a-robinson commented 6 years ago

Ping @a-robinson just in case anything from #21528 is applicable here. I believe the dataset there was much smaller, right?

It took about 5 days of running kv --read-percent=0 --max-rate=1200 before problems started showing up on the 6-node cluster under chaos. I'm not sure if I was using a custom block size or how much data had been generated by the time the problems started.

I don't think there's anything particularly applicable here that we learned from that issue. The main lessons were:

manual compactions are slow
manual compactions of key ranges that have had data written to them since the deletion are particularly slow and expensive
using the exclusive_manual_compaction setting is a bad idea if you care about foreground writes

bdarnell commented 6 years ago

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

For a hail-mary short term fix, this seems to be a promising venue if it's true. Were you ever able to figure out if that should be happening?

I haven't found any code that would do this, but I can't rule it out. It would need to happen on compaction but not, I think, on all uses of the range tombstones so I'm not sure I've looked in the right places. Might be worth trying an empirical test.

In other news, I ran the experiment with a) ClearIterRange b) tombstone-sensitive compaction and the cluster was a flaming pile of garbage for most of the time, but came out looking good:

Might make sense as a cluster setting to give people some way to delete large tables without knocking the cluster out completely.

tbg commented 6 years ago

Might make sense as a cluster setting to give people some way to delete large tables without knocking the cluster out completely.

I'm on the fence about that -- with the range deletion tombstones, you can run a full compaction and you're back. That option does not exist with the ClearIterRange change; you're kinda screwed until it clears up by itself.

tbg commented 6 years ago

Digging into individual values on 008819, we see that the end key of one range and the start key of the next are adjacent in cockroach terms but different at the rocksdb level. This is the start and end key of two adjacent tombstones:

@bdarnell, do you understand why that is? For two adjacent ranges, we would first clear a range with the right boundary MVCCKey{Key: middle} and then an adjacent range starting at MVCCKey{Key: middle}. Where does the hole come from?

Also, do you or @benesch still have the magic invocation around to dump the number of range tombstones?

bdarnell commented 6 years ago

From https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-374384340, it appears that the difference is that the start keys contain a /0 column family suffix, while the start keys do not. I haven't investigated why that would be the case.

tbg commented 6 years ago

Still trying to figure out how to dump the tombstones. My best so far (but it doesn't include them):

./cockroach debug rocksdb dump --path=/../000007.sst --stats

benesch commented 6 years ago

You want to compile the sst_dump program (it’s not part of ldb for whatever reason) and run that on each of the SST files. I forget exactly what options you need but it should be pretty obvious. cd c-deps/rocksdb && make sst_dump

On Mon, May 7, 2018 at 6:54 AM Tobias Schottdorf notifications@github.com wrote:

Still trying to figure out how to dump the tombstones. My best so far (but it doesn't include them):

./cockroach debug rocksdb manifest_dump --path=.../000007.sst --stats

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-387029821, or mute the thread https://github.com/notifications/unsubscribe-auth/AA15IPTPqLqNBlzQ0dIvawkTVY6UkAc-ks5twCgCgaJpZM4SwswS .

tbg commented 6 years ago

FWIW, these are the (compaction queue) compactions I'm seeing while running this.

$ roachprod ssh tobias-drop:1-10 'grep "processing compact" logs/cockroach.log'
tobias-drop: grep "processing compact" l... 10/10
   1: I180519 06:08:35.644623 260 storage/compactor/compactor.go:300  [n1,s1] processing compaction #1-372/589 (/Table/51/1/0/0-/Table/51/1/2528853) for 9.3 GiB (reasons: size=true used=false avail=false)
   2: I180519 06:08:33.889164 174 storage/compactor/compactor.go:300  [n8,s8] processing compaction #1-370/718 (/Min-/Table/51/1/3470314) for 8.1 GiB (reasons: size=true used=false avail=false)
I180519 06:10:15.772722 174 storage/compactor/compactor.go:300  [n8,s8] processing compaction #600-628/718 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB (reasons: size=true used=false avail=false)
   3: I180519 06:09:07.943102 872 storage/compactor/compactor.go:300  [n4,s4] processing compaction #1-725/2031 (/Table/51/1/311532/0-/Table/51/1/6289196) for 18 GiB (reasons: size=true used=true avail=false)
   4: I180519 06:08:36.240354 395 storage/compactor/compactor.go:300  [n3,s3] processing compaction #1-340/524 (/System/tsd-/Table/51/1/2258625/0) for 7.4 GiB (reasons: size=true used=false avail=false)
I180519 06:12:57.768175 395 storage/compactor/compactor.go:300  [n3,s3] processing compaction #349-450/524 (/Table/51/1/2623759-/Table/51/1/3487708) for 2.1 GiB (reasons: size=true used=false avail=false)
   5: I180519 06:08:34.510402 207 storage/compactor/compactor.go:300  [n6,s6] processing compaction #1-267/562 (/Table/51/1/6546/0-/Table/51/1/2528853/0) for 6.7 GiB (reasons: size=true used=false avail=false)
   6: I180519 06:08:46.769268 898 storage/compactor/compactor.go:300  [n5,s5] processing compaction #1-672/2300 (/Table/51/1/0/0-/Table/51/1/4442685/0) for 14 GiB (reasons: size=true used=false avail=false)
   7: I180519 06:08:34.357633 168 storage/compactor/compactor.go:300  [n7,s7] processing compaction #1-303/507 (/Table/13-/Table/51/1/3034152/0) for 7.9 GiB (reasons: size=true used=false avail=false)
I180519 06:12:40.379005 168 storage/compactor/compactor.go:300  [n7,s7] processing compaction #1-976/1135 (/Table/51/1/1808697/0-/Table/51/1/12112569/0) for 26 GiB (reasons: size=true used=true avail=true)
I180519 06:14:08.812506 168 storage/compactor/compactor.go:300  [n7,s7] processing compaction #976-1023/1135 (/Table/51/1/12169617-/Table/51/1/13091303) for 1.3 GiB (reasons: size=true used=false avail=false)
   8: I180519 06:08:35.980654 236 storage/compactor/compactor.go:300  [n10,s10] processing compaction #1-326/502 (/System/""-/Table/51/1/3467041) for 8.7 GiB (reasons: size=true used=false avail=false)
   9: I180519 06:09:12.348539 799 storage/compactor/compactor.go:300  [n2,s2] processing compaction #1-760/1716 (/System/tsd-/Table/51/1/6864500) for 19 GiB (reasons: size=true used=true avail=true)
  10: I180519 06:08:35.364444 233 storage/compactor/compactor.go:300  [n9,s9] processing compaction #1-392/578 (/Table/51/1/9819/0-/Table/51/1/3399280) for 10 GiB (reasons: size=true used=false avail=false)

tbg commented 6 years ago

(this shows the fragmentation -- one of the compactions is for 427mb, and another for 19gb -- it depends how many contiguous replicas you have). Pushing ClearRange into the compactor should give some constant factor which seems to be kind of large (~10gb/64mb), but it doesn't technically solve the problem of too many tombstones.

tbg commented 6 years ago

(it actually does technically solve the problem if you compact after putting down each tombstone, as it guarantees that there's only ever one range tombstone originating from the DROP in RocksDB, at least if the compactions always "resolve" the tombstone -- which I hope they do -- and don't just carry it into higher levels).

I'll try the approach this week.

benesch commented 6 years ago

RocksDB->CompactRange, assuming that’s actually what we call, claims to compact all the way down to the lowest level, so the tombstones should definitely be dropped (modulo bugs in RDB): https://github.com/facebook/rocksdb/wiki/Manual-Compaction

On Mon, May 21, 2018 at 5:39 AM Tobias Schottdorf notifications@github.com wrote:

(it actually does technically solve the problem if you compact after putting down each tombstone, as it guarantees that there's only ever one range tombstone originating from the DROP in RocksDB, at least if the compactions always "resolve" the tombstone -- which I hope they do -- and don't just carry it into higher levels).

I'll try the approach this week.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-390514954, or mute the thread https://github.com/notifications/unsubscribe-auth/AA15IAd4yDn-H1Qcyc9PFSLCWiE1s4IZks5t0eKbgaJpZM4SwswS .

tbg commented 6 years ago

The more I look at this, the more it seems that we don't even need range tombstones to make this kind of thing a mess.

I'm running a custom branch in which ClearRange doesn't do anything except make the suggestion to the compactor, and the compactor adds a range tombstone and immediately compacts it away, plus some other hacks like disabling the consistency checker and never aggregating compactions that are not adjacent because that would delete lots of data now.

The result: still a broken cluster. Initially (at night) there are compactions that take lots of time, now (in the morning) there's a steady stream of small compactions, each taking around 1s. It does this because the logical bytes of the store are now ~0, so even a 32mb compaction triggers the heuristic that is supposed to reclaim space for very small stores (which this is not). That this effectively strangles the cluster is troubling, because it implies that few 32mb compactions back to back will have a troubling effect. (@bdarnell is this perhaps related to the large MANIFEST files? The first node is on /mnt/data1/cockroach/MANIFEST-009972 which seems like a high number).

I180522 10:17:10.335792 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6181/6205 (/Table/51/1/62105934/0-/Table/51/1/62109206) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:11.317257 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6181/6205 (/Table/51/1/62105934/0-/Table/51/1/62109206) for 32 MiB in 981.411309ms
I180522 10:17:11.317318 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6182/6205 (/Table/51/1/62109206/0-/Table/51/1/62112478) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:12.139546 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6182/6205 (/Table/51/1/62109206/0-/Table/51/1/62112478) for 32 MiB in 822.211223ms
I180522 10:17:12.139579 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6183/6205 (/Table/51/1/62112478/0-/Table/51/1/62115750) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:12.981165 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6183/6205 (/Table/51/1/62112478/0-/Table/51/1/62115750) for 32 MiB in 841.533492ms
I180522 10:17:12.981194 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6184/6205 (/Table/51/1/62115750/0-/Table/51/1/62119022) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:13.842772 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6184/6205 (/Table/51/1/62115750/0-/Table/51/1/62119022) for 32 MiB in 861.534854ms
I180522 10:17:13.842802 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6185/6205 (/Table/51/1/62119022/0-/Table/51/1/62122294) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:14.686480 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6185/6205 (/Table/51/1/62119022/0-/Table/51/1/62122294) for 32 MiB in 843.570175ms
I180522 10:17:14.686522 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6186/6205 (/Table/51/1/62122294/0-/Table/51/1/62125566) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:15.533905 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6186/6205 (/Table/51/1/62122294/0-/Table/51/1/62125566) for 32 MiB in 847.347856ms
I180522 10:17:15.533935 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6187/6205 (/Table/51/1/62125566/0-/Table/51/1/62128838) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:16.412174 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6187/6205 (/Table/51/1/62125566/0-/Table/51/1/62128838) for 32 MiB in 878.192735ms
I180522 10:17:16.412220 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6188/6205 (/Table/51/1/62128838/0-/Table/51/1/62128931) for 933 KiB (reasons: size=false used=true avail=false)
I180522 10:17:17.135783 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6188/6205 (/Table/51/1/62128838/0-/Table/51/1/62128931) for 933 KiB in 723.501353ms
I180522 10:17:17.135812 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6189/6205 (/Table/51/1/62128931/0-/Table/51/1/62132203) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:17.992488 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6189/6205 (/Table/51/1/62128931/0-/Table/51/1/62132203) for 32 MiB in 856.625562ms
I180522 10:17:17.992517 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6190/6205 (/Table/51/1/62132203/0-/Table/51/1/62135475) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:18.853890 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6190/6205 (/Table/51/1/62132203/0-/Table/51/1/62135475) for 32 MiB in 861.308991ms
I180522 10:17:18.853920 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6191/6205 (/Table/51/1/62135475/0-/Table/51/1/62137901) for 24 MiB (reasons: size=false used=true avail=false)
I180522 10:17:19.669350 270 storage/compactor/compactor.go:322  [n1,s1] processed compaction #6191/6205 (/Table/51/1/62135475/0-/Table/51/1/62137901) for 24 MiB in 815.383245ms
I180522 10:17:19.669378 270 storage/compactor/compactor.go:300  [n1,s1] processing compaction #6192/6205 (/Table/51/1/62137901/0-/Table/51/1/62141173) for 32 MiB (reasons: size=false used=true avail=false)
I180522 10:17:19.997260 355 server/status/runtime.go:219  [n1] runtime stats: 5.5 GiB RSS, 706 goroutines, 371 MiB/44 MiB/538 MiB GO alloc/idle/total, 3.9 GiB/5.0 GiB CGO alloc/total, 462.10cgo/sec, 0.94/0.11 %(u

The cluster is still unhappy. I disabled the compactor and it took ~30s to set the cluster setting. So at the end of the day, it looks like these tombstones may just be another 1000 nails in a coffin that already has lots of nails in it.

By the way, I really like the health alerts I recently introduced:

W180522 06:39:30.125437 357 server/node.go:802  [n1,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:requests.slow.distsender Value:67} {StoreID:1 Category:METRICS Description:queue.raftsnapshot.process.failure Value:1}]}
[...]
W180522 06:41:41.961410 357 server/node.go:802  [n1,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:requests.slow.distsender Value:67}]}
[...]
W180522 10:26:20.088564 357 server/node.go:802  [n1,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:requests.slow.distsender Value:94} {StoreID:1 Category:METRICS Description:queue.raftlog.process.failure Value:1} {StoreID:1 Category:METRICS Description:requests.slow.lease Value:12}]}

I was hoping that the cluster would become healthy again after disabling the compactor, but it hasn't during the last 10min.

tbg commented 6 years ago

For anyone who can read them (@petermattis?) here is a compaction stats dump that is pretty typical of others while the compactor was running:

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.2      0.2       0.0   1.0      0.0      2.5        72      1640    0.044       0      0
  L2      0/0    0.00 KB   0.0      0.3     0.2      0.1       0.2      0.1       0.0   1.4     25.1     22.5        10         5    2.091     12M    16K
  L3      0/0    0.00 KB   0.0      0.3     0.1      0.2       0.3      0.1       0.1   2.9     68.4     68.1         4         4    1.051    461K    40K
  L4      0/0    0.00 KB   0.0      1.6     0.2      1.4       1.6      0.2       0.1   6.8     66.4     65.2        25         6    4.204   1127K    50K
  L5      1/1    1.41 KB   0.0      2.8     1.6      1.1       2.7      1.6       0.0   1.7     25.2     25.0       112      1333    0.084     17M    40K
  L6    780/1   67.63 GB   0.0    412.0     2.1    409.9     339.3    -70.7       0.0 163.4     53.3     43.9      7911      1566    5.052    458M    19M
 Sum    781/2   67.63 GB   0.0    416.9     4.2    412.7     344.3    -68.4       0.2 1965.5     52.5     43.3      8136      4554    1.786    490M    20M
 Int      0/0    0.00 KB   0.0     45.6     0.0     45.6      45.6      0.0       0.0 66037.6     82.9     82.9       562      1398    0.402    140M   3667
Uptime(secs): 8437.5 total, 600.0 interval
Flush(GB): cumulative 0.175, interval 0.001
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 344.29 GB write, 41.78 MB/s write, 416.94 GB read, 50.60 MB/s read, 8135.7 seconds
Interval compaction: 45.56 GB write, 77.76 MB/s write, 45.56 GB read, 77.76 MB/s read, 562.4 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

Looks like over 600s we write ~50gb to disk, at 70mb/s reads and 70mb/s writes. That seems pretty hefty. The original idea with the tombstones was that compacting them down would be cheap, as they delete everything. In what I'm doing, where I removed compactions, this is not true of course because the compactions are rather small (and so you rewrite each bottom SST a few - maybe like 4-6? times).

I can improve my prototype so that it aggregates over gaps as well, without covering them with tombstones, to see how that would fare.

What else should I be looking at to evaluate this performance? Feels that our toolbox to understand these issues is not great because we don't reason about the RocksDB layer very well.

tbg commented 6 years ago

Ah, the cluster is still recovering because the Raft snapshot queue is backed up. We might be limited by the quota pool which then causes requests to be held back.

tbg commented 6 years ago

... and I won't be able to investigate the quota pool because neither does it have a cluster setting nor does it export any metrics. Womp womp. Filed https://github.com/cockroachdb/cockroach/issues/25799.

Also, I just saw a flurry of these fly by:

W180522 11:20:50.602707 1626794 kv/dist_sender.go:1301  [n1] have been waiting 1m0s sending RPC to r42488 (currently pending: [(n3,s3):1]) for batch: ClearRange [/Table/51/1/64734528/0,/Table/51/1/64737800)

I think what's happening here is that the schema change didn't run to completion and so now it's running again. @bdarnell that seems like a good explanation for the Manifest file growth seen elsewhere -- it just retries and retries, and if the thing being dropped is just one SST, you're going to rewrite that SST over and over over time. What I don't understand is that if that's the case, why wouldn't the schema change succeed? I can understand why it wouldn't in the experiment which I'm running.

petermattis commented 6 years ago

@tschottdorf Re: compaction stats. A W-Amp number of 1965.5 is huge. This it the write amplification number (I'm not sure precisely how it is collected) and is related to how many times a particular byte of data is rewritten. Generally you want to see this in the low double digits. This is a pretty good indication that we're doing something very wrong with our compaction suggestions.

bdarnell commented 6 years ago

That this effectively strangles the cluster is troubling, because it implies that few 32mb compactions back to back will have a troubling effect.

Note that it's misleading to think of these as 32MB compactions. They're rewriting at least one 128MB sstable, and likely two.

But my guess is that the problem has less to do with the amount of IO and more to do with the level of synchronization required to do any compaction. This argues for being much more conservative in how often we trigger manual compactions. Compactions don't really become "cheap" until they can discard entire sstables (and even then unless the boundaries line up they may have to rewrite adjacent SSTs).

is this perhaps related to the large MANIFEST files? The first node is on /mnt/data1/cockroach/MANIFEST-009972 which seems like a high number

Yes, it's related, because every compaction increases the size of the manifest file (the number in the manifest file doesn't mean much because all rocksdb filenames draw their numbers from a shared sequence. What matters is the highest number across every file in the directory)

bdarnell commented 6 years ago

I looked at the rocksdb code to see what sort of synchronization is involved in a compaction and found that DBImpl::RunManualCompaction holds DBImpl::mutex_ for what looks like the bulk of its running time. On its face, this looks extremely expensive, although there are other indications (the exlusive_manual_compaction option, which we set to false) which suggest that manual compactions are expected to run concurrently with other operations. Note that background compactions do not appear to acquire this lock, but reads and writes do.

bdarnell commented 6 years ago

On its face, this looks extremely expensive

D'oh, nevermind. It actually spends nearly all of its time in a CondVar::Wait with the lock released.

petermattis commented 6 years ago

Well, I can reproduce badness using #25837. Not sure what is going on other than a shit-ton of disk activity (both reads and writes). Presumably, that is out of control compactions, but I haven't had time to verify that.

petermattis commented 6 years ago

I'm not at all familiar with this issue or the debugging that has gone on so far or even the systems involved, so I started working on it from the beginning. First, I'm able to reproduce the cluster wedging using the clearrange roachtest from #25837 with the compactor.enabled cluster setting turned off. So whatever is happening, it isn't the compaction queue going crazy (unless there is something broken with the cluster setting). I then noticed that when the cluster wedges there are a lot of ClearRange slow RPC messages. Thousands of them. Tracing through the code, I see that dropping a table results in a single ClearRange for the entire table key range:

I180525 23:39:16.142667 257 sql/tablewriter_delete.go:184  [n9] ClearRange /Table/51 - /Table/52

DistSender is presumably splitting this single operation into thousands of requests to the 10s of thousands of ranges in this table. Interestingly, this is not the only ClearRange operation logged from that line of code in the cluster:

I180525 23:24:02.838137 314 sql/tablewriter_delete.go:184  [n3] ClearRange /Table/51 - /Table/52
I180525 23:29:07.004769 303 sql/tablewriter_delete.go:184  [n10] ClearRange /Table/51 - /Table/52
I180525 23:34:11.599280 252 sql/tablewriter_delete.go:184  [n7] ClearRange /Table/51 - /Table/52

Notice that the timestamps are ~5min apart. I believe this corresponds to the 5min SchemaChangeLeaseDuration. I haven't verified it yet, but I'd be willing to bet that the ClearRange operation is taking longer than 5min to complete which results in the schema change lease for this table being picked up by another node which then executes another ClearRange operation which will take longer than 5min and on and on. Why this eventually wedges the cluster is unclear.

I'm going to try bumping SchemaChangeLeaseDuration to a much higher value to see if that allows the drop to succeed without wedging the cluster. I'm also going to turn on additional logging at the DistSender level to make sure that the ClearRange operation is making progress and not retrying RPCs for individual ranges over and over.

petermattis commented 6 years ago

After bumping SchemaChangeLeaseDuration to 100min I no longer see the multiple ClearRange operations from tablewriter_delete.go. Unfortunately, the cluster still wedged. DistSender log messages indicate that it was processing several thousand RPCs per minute. Considering there are ~40k ranges in this test, the expected time for the ClearRange operation is ~7min. I accidentally fat-fingered the cluster logs, but before doing so I saw DistSender log messages showing it was still processing the ClearRange for more than 20min.

tbg commented 6 years ago

There are likely multiple levels of badness here, with each one causing problems individually:

laying down many RocksDB range tombstones, which slow down RocksDB to a crawl
compactions
the extremely large cross-range RPC that may not finish fast enough.

My past experiments were a) no compactor b) lay down ClearRange just before compacting (though that resulted in frequent small compactions in my prototype due to the way I hacked it together). I had also noticed that the schema changer seemed to time out, but at that point it wasn't clear why.

The ClearRange RPCs also has a code path that recomputes stats, so if that gets hit a lot (it shouldn't) some slowness may also be explained by it.

As a smoke test, with a return inserted here, the roachtest should pass trivially -- the schema-change turns into a sophisticated no-op.

As a second (more involved) smoke test, you can remove the batch.ClearRange and instead let the compactor do it just as it aggregates the suggestion into a compaction, which minimizes the amount of time the tombstone is around for. I had a bad WIP for this (which didn't deal well with the feature of the compactor that compacts across "gaps" between the suggestions) and the result was.. not good.

I think one general problem here is that the original thinking was that the compactions would be really cheap because the keyspace compacted is completely covered by RocksDB range tombstones. But that's not true because each node has a more-or-less random subset of replicas, and so the compactions we run typically touch either only a part of an SST, or span multiple SSTs (but not gapless range tombstones).

If this is indeed the problem, we could introduce a smarter mechanism that "covers up" the holes in the keyspace with fresh range deletion tombstones, so that the compaction (in the case of the DROP) really gets to see a completely range-deleted keyspace. I wasn't able to play with this because the naive way of prototyping it (running engine.ClearRange on the key span to be compacted) trips (*Replica).assertStateLocked. Not sure if that's because some replicas receive ClearRange really late, or because the span got extended beyond that of the dropped table.

spencerkimball commented 6 years ago

@tschottdorf have you guys bothered to give this a try: https://github.com/cockroachdb/cockroach/pull/24137

petermattis commented 6 years ago

@tschottdorf I see reference about to a CPU profile showing badness in RocksDB due to large numbers of range tombstones. How was that gathered? I haven't had a chance to poke at this any more this weekend, but so far I haven't seen evidence of RocksDB badness (to be fair, I haven't really been looking). I'm mentioning this because I'm working my way down towards RocksDB and want to to understand what evidence led to the conclusion that large numbers of range tombstones is problematic.

I think one general problem here is that the original thinking was that the compactions would be really cheap because the keyspace compacted is completely covered by RocksDB range tombstones. But that's not true because each node has a more-or-less random subset of replicas, and so the compactions we run typically touch either only a part of an SST, or span multiple SSTs (but not gapless range tombstones).

If this is indeed the problem, we could introduce a smarter mechanism that "covers up" the holes in the keyspace with fresh range deletion tombstones, so that the compaction (in the case of the DROP) really gets to see a completely range-deleted keyspace. I wasn't able to play with this because the naive way of prototyping it (running engine.ClearRange on the key span to be compacted) trips (*Replica).assertStateLocked. Not sure if that's because some replicas receive ClearRange really late, or because the span got extended beyond that of the dropped table.

I'd like to better understand what is going on at the RocksDB level before thinking too much about possible solutions. I worry that we could be papering over the real problem.

tbg commented 6 years ago

@petermattis When Nikhil originally encountered the problem, he tried to restart one of the affected nodes and it never booted up, because iterating over the ~~replicas~~ range descriptors was extremely slow. I think a CPU profile showed activity in a method related to tombstones, and upstream confirms that these are handled in a wildly inefficient manner (plus, we found hundreds of tombstones in some SSTs by manual dumping). I think this also proves that the compactor is not the single root problem, as it's not running at that point in time (though maybe it is).

I think the smoke test (in which you simply don't lay down tombstones as suggested above) can help reason about above-RocksDB problems to some extent (at least we all agree that there shouldn't be anything going wrong in that case, except not actually deleting anything).

cockroachdb / cockroach

storage: dropping a large table will brick a cluster due to compactions #24029