stability: Compaction queue causes major disruptions during chaos events

a-robinson commented 6 years ago

I've been running a cluster under chaos since late last week to test https://github.com/cockroachdb/cockroach/pull/21386, and found that we were experiencing considerable periods of 0 QPS and substantially reduced QPS after a downed node is considered dead (i.e. 5 minutes after it goes down). I followed a hunch and disabled the compaction queue via the diff below and found that cockroach performs admirably under chaos without it.

diff --git a/pkg/storage/engine/rocksdb.go b/pkg/storage/engine/rocksdb.go
index 7aa4fbe4f..b577e427b 100644
--- a/pkg/storage/engine/rocksdb.go
+++ b/pkg/storage/engine/rocksdb.go
@@ -849,7 +849,8 @@ func (r *RocksDB) Compact() error {

 // CompactRange forces compaction over a specified range of keys in the database.
 func (r *RocksDB) CompactRange(start, end roachpb.Key) error {
-       return statusToError(C.DBCompactRange(r.rdb, goToCSlice(start), goToCSlice(end)))
+       return nil
 }

My cluster has gone through dozens of bad chaos events with the compaction queue enabled, which all look something like this:

A good run, after having disabled manual rocksdb compactions, looks like this:

At a high level, the problem is that doing manual compactions during a period of heavy rebalancing puts even more load on an already loaded storage engine / disk (i.e. defaultCompactionMinInterval = 2 * time.Minute isn't large enough, but how large should it be?). At a lower level, I wonder if we're somehow using up all of rocksdb's available compaction/flush threads, and that's what's clogging things up.

We may be able to fix this simply by delaying compaction until after we haven't sent or received a snapshot for some amount of time, or the fix may be more involved. Either way, the fix needs to be verified under heavy load and chaos.

a-robinson commented 6 years ago

If anyone wants to see the chaos events themselves, check out the alex-chaos roachprod cluster. I'll keep it running through tomorrow to make sure that we don't have any issues with the manual compactions disabled. So far it's only gone through a couple of large chaos events, which are enough to convince me that it's far better than running with manual compactions, but not enough to rule out other problems.

a-robinson commented 6 years ago

I'll put together a prototype of delaying compactions until snapshots have quieted down. It'll almost certainly fix this particular chaos scenario, but I'm still curious about what's going on at the rocksdb level. If manual compactions can clog things up, it'd be good to fully understand why and whether we can prevent it.

spencerkimball commented 6 years ago

Why is the compaction queue active when we’re up-replicating due to a dead node? Shouldn’t that just be creating new replicas, not deleting dead ones? The node with dead replicas is awol right?

On Wed, Jan 17, 2018 at 2:15 PM Alex Robinson notifications@github.com wrote:

If anyone wants to see the chaos events themselves, check out the alex-chaos roachprod cluster. I'll keep it running through tomorrow to make sure that we don't have any issues with the manual compactions disabled. So far it's only gone through a couple of large chaos events, which are enough to convince me that it's far better than running with manual compactions, but not enough to rule out other problems.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/21528#issuecomment-358466743, or mute the thread https://github.com/notifications/unsubscribe-auth/AF3MTaE36F1v3f-ExBF6kC7IvrwoFuJHks5tLnECgaJpZM4RiDdv .

a-robinson commented 6 years ago

The node with dead replicas is awol right?

Completely offline, yeah.

Why is the compaction queue active when we’re up-replicating due to a dead node? Shouldn’t that just be creating new replicas, not deleting dead ones?

Either there's a bug causing it to trigger when any replica is removed from a range, not just the replica on the local node, or there's more rebalancing being triggered by the dead node than just up-replicating. It's hard to tell from the UI, because it bunches together replica removals from dead nodes with replica removals from live nodes (shown below). I'll trawl through the system.rangelog to see if there were any removals from live stores.

a-robinson commented 6 years ago

a-robinson commented 6 years ago

I checked the range log, and in the incident screenshotted above (from around 18:06 to 18:21) there were no removals from any store other than the dead one, i.e. there was no rebalancing going on other than up-replicating.

I'll turn on more verbose compaction queue logging as you requested offline to see if that illuminates anything, but let me know if you have ideas and would like to take over.

Also, as a side note, it'd be nice if we didn't have to build a custom binary to get access to the rocksdb debug log. I'm not missing any vmodule settings that allow you to do that, am I?

benesch commented 6 years ago

Does --vmodule=rocksdb=3 not do it?

tbg commented 6 years ago

@a-robinson are you restarting nodes in order? If so, then a node that will come up again only then actually removes all of its replicas and starts compacting the keyspace. So perhaps there's an off-by-one: you're not seeing problems due to the down node, but due to your other nodes having been past down nodes. If you think this might be possible, it's easy enough to check - just make the node that gets killed always the same one.

I checked the code and found no issues with how the compactions are suggested. Curious what this is.

tbg commented 6 years ago

@a-robinson it's also worth logging the caller in SuggestCompaction and making the compactor.go logging a bit more noisy (so that we at least see an info log when .CompactRange is called).

a-robinson commented 6 years ago

Does --vmodule=rocksdb=3 not do it?

Yes it does, thanks.

@a-robinson are you restarting nodes in order? If so, then a node that will come up again only then actually removes all of its replicas and starts compacting the keyspace. So perhaps there's an off-by-one: you're not seeing problems due to the down node, but due to your other nodes having been past down nodes. If you think this might be possible, it's easy enough to check - just make the node that gets killed always the same one.

So based on the verbose compactor logs, this is partially right, but I still don't understand exactly what's happening. Manual compactions are happening quite frequently, almost constantly, on all nodes that have been restarted. Some take a few minutes, some take tens of minutes. A few happen around when an incident starts, but many others aren't correlated with incidents at all, without it being obvious to me why some appear to cause incidents and others don't. Also, weirdly, it appears to be the end of compaction processing that correlates with outages starting. Here are the relevant compaction start/end events from the verbose logs of the cluster: https://gist.github.com/a-robinson/1ae0ad0abf973e7df5d3d940a6975714

I'm still not sure why these problems only seem to happen (or are at their worst, at the very least) when a node has recently been declared dead. It may just be that the up-replicating interacts badly with the compacting, but I don't have any real evidence to back that up.

tbg commented 6 years ago

What's your chaos timing? The compactor tries to aggregate compactions, but if the replicas get GC'ed over a period of time long enough to bridge the gap to the next chaos event (targeting that node), it's conceivable that you'd just always have something to compact on pretty much all the nodes. Since the node removes roughly all the data and then gets it back, you're running a spread-out full compaction (or worse, there might be overlap on the SST level) online whenever the node is up.

I wonder whether removing kForceCompactionFull (or whatever it's called) can help (though we should understand better what the big problem is in the first place).

a-robinson commented 6 years ago

The chaos timing was just ripped off from blue's old config:

0  0-23/6 * * * /home/cockroach/chaos_runner.sh 60 1200
48 1-23/6 * * * /home/cockroach/chaos_runner.sh 60 1200
36 2-23/6 * * * /home/cockroach/chaos_runner.sh 60 1200
24 3-23/6 * * * /home/cockroach/chaos_runner.sh 60 1200
12 4-23/6 * * * /home/cockroach/chaos_runner.sh 60 1200
0  5-23/6 * * * /home/cockroach/chaos_runner.sh 60 1200

i.e. a node is taken down for between 1 and 20 minutes every 1-2 hours, and it round robins through the 6 nodes.

The compactor tries to aggregate compactions, but if the replicas get GC'ed over a period of time long enough to bridge the gap to the next chaos event (targeting that node), it's conceivable that you'd just always have something to compact on pretty much all the nodes. Since the node removes roughly all the data and then gets it back, you're running a spread-out full compaction (or worse, there might be overlap on the SST level) online whenever the node is up.

I'm still not getting why this would get so drastically worse when a node is dead, but do get your point that this sort of chaos is somewhat pathological for the compaction queue.

a-robinson commented 6 years ago

Well, in fun news from last night, the cluster is now also occasionally experiencing 0 QPS periods even when a node isn't dead. The only changes from before are that there's more data in the cluster and I'm running with verbose compactor and rocksdb logging.

a-robinson commented 6 years ago

Doing a little looking at the rocksdb logs this afternoon, it's clear that the manual compactions are overloading rocksdb and causing it to stall writes. For example, on one of the nodes in particular, there was a steady rise in the number of L0 files over a few minutes happening at the same time as a series of very large (many hundreds of megabytes to low gigabytes) manual L6 compactions. Making matters even worse is the fact that they were clearly re-writing way more data than was beneficial, judging by the before and after numbers of bytes. For example:

I180119 21:34:32.212306 17 storage/engine/rocksdb.go:74  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1466] [default] [JOB 982] Compacting 8@5 + 8@6 files to L6, score -1.00
I180119 21:34:32.212320 17 storage/engine/rocksdb.go:74  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1470] [default] Compaction start summary: Base version 981 Base level 5, inputs: [504390(354KB) 504391(206KB) 504392(20MB) 504393(21MB) 504396(18MB) 504397(394KB) 504398(34MB) 504399(25MB)], [503493(100MB) 503496(100MB) 503497(100MB) 503498(100MB) 503499(100MB) 503500(100MB) 503501(100MB) 503502(95MB)]
I180119 21:34:32.212348 17 storage/engine/rocksdb.go:74  EVENT_LOG_v1 {"time_micros": 1516397672212331, "job": 982, "event": "compaction_started", "files_L5": [504390, 504391, 504392, 504393, 504396, 504397, 504398, 504399], "files_L6": [503493, 503496, 503497, 503498, 503499, 503500, 503501, 503502], "score": -1, "input_data_size": 963669401}
...
I180119 21:34:47.181763 17 storage/engine/rocksdb.go:74  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1227] [default] [JOB 982] Compacted 8@5 + 8@6 files to L6 => 858611159 bytes
...
I180119 21:34:47.276610 17 storage/engine/rocksdb.go:74  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1466] [default] [JOB 983] Compacting 13@5 + 12@6 files to L6, score -1.00
I180119 21:34:47.276628 17 storage/engine/rocksdb.go:74  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1470] [default] Compaction start summary: Base version 982 Base level 5, inputs: [504400(105KB) 504401(103KB) 504402(42MB) 504403(200KB) 504404(403KB) 504405(340KB) 504406(18MB) 504407(334KB) 504408(5784KB) 504409(12MB) 504410(215KB) 504411(405KB) 504412(20MB)], [503503(100MB) 503504(100MB) 503505(100MB) 503506(100MB) 503507(100MB) 503508(100MB) 503509(100MB) 503510(100MB) 503511(100MB) 503514(100MB) 503515(100MB) 503591(100MB)]
I180119 21:34:47.276659 17 storage/engine/rocksdb.go:74  EVENT_LOG_v1 {"time_micros": 1516397687276639, "job": 983, "event": "compaction_started", "files_L5": [504400, 504401, 504402, 504403, 504404, 504405, 504406, 504407, 504408, 504409, 504410, 504411, 504412], "files_L6": [503503, 503504, 503505, 503506, 503507, 503508, 503509, 503510, 503511, 503514, 503515, 503591], "score": -1, "input_data_size": 1366981518}
...
I180119 21:35:10.044451 17 storage/engine/rocksdb.go:74 [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1227] [default] [JOB 983] Compacted 13@5 + 12@6 files to L6 => 1280350099 bytes

After a few minutes of this going on while new replicas were also being added to the node (due to the up-replicating of ranges from the dead node), the number of L0 files hit the configured limit (Options.level0_slowdown_writes_trigger) and rocksdb started stalling writes:

I180119 21:35:50.249950 17 storage/engine/rocksdb.go:74  [WARN] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/column_family.cc:704] [default] Stalling writes because we have 20 level-0 files
 rate 16777216

I expect that the manual compaction made recovery from this slower than it would normally be -- left to its own devices, I'd imagine rocksdb would avoid unnecessary background compactions and focus on clearing out L0. In this case, though, rocksdb kept doing the work of the manual compactions even though foreground writes were stalled.

Rocksdb kept lowering the rate limit as the number of L0 files failed to reduce -- the number of files actually went up one more to 21:

I180119 21:35:59.413285 17 storage/engine/rocksdb.go:74  [WARN] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/column_family.cc:704] [default] Stalling writes because we have 21 level-0 files
 rate 4398044

The rate limit appears to bottom out at merely 16KB per second:

I180119 21:37:00.451764 17 storage/engine/rocksdb.go:74  [WARN] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/column_family.cc:704] [default] Stalling writes because we have 21 level-0 files rate 16384

The rate limit ground things to a halt. Pretty much all write batches were taking more than a second to commit at this point, and the manual compactions were still happening.

Curiously, there were a lot of compactions of just a single file. If it's actually doing the reading/writing/deleting of files for these, which it looks like it is, it's a complete waste of resources given my understanding of rocksdb, and may be an issue with the manual compaction code. Here's one such compaction, but there were tons of these:

I180119 21:38:32.259160 17 storage/engine/rocksdb.go:74  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1466] [default] [JOB 1089] Compacting 1@6 files to L6, score -1.00
I180119 21:38:32.259171 17 storage/engine/rocksdb.go:74  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1470] [default] Compaction start summary: Base version 1088 Base level 6, inputs: [504551(100MB)]
I180119 21:38:32.259195 17 storage/engine/rocksdb.go:74  EVENT_LOG_v1 {"time_micros": 1516397912259181, "job": 1089, "event": "compaction_started", "files_L6": [504551], "score": -1, "input_data_size": 104981818}
I180119 21:38:33.976005 17 storage/engine/rocksdb.go:74  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1153] [default] [JOB 1089] Generated table #504682: 858157 keys, 104981818 bytes
I180119 21:38:33.976040 17 storage/engine/rocksdb.go:74  EVENT_LOG_v1 {"time_micros": 1516397913976023, "cf_name": "default", "job": 1089, "event": "table_file_creation", "file_number": 504682, "file_size": 104981818, "table_properties": {"data_size": 104883143, "index_size": 141137, "filter_size": 0, "raw_key_size": 25745066, "raw_average_key_size": 30, "raw_value_size": 87965487, "raw_average_value_size": 102, "num_data_blocks": 3209, "num_entries": 858157, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "0", "kMergeOperands": "0"}}
I180119 21:38:33.988387 17 storage/engine/rocksdb.go:74  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1227] [default] [JOB 1089] Compacted 1@6 files to L6 => 104981818 bytes
I180119 21:38:33.996218 17 storage/engine/rocksdb.go:74  (Original Log Time 2018/01/19-21:38:33.996083) EVENT_LOG_v1 {"time_micros": 1516397913996077, "job": 1089, "event": "compaction_finished", "compaction_time_micros": 1729159, "output_level": 6, "num_output_files": 1, "total_output_size": 104981818, "num_input_records": 858157, "num_output_records": 858157, "num_subcompactions": 1, "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [21, 0, 0, 1, 0, 9, 129]}
I180119 21:38:34.005968 17 storage/engine/rocksdb.go:74  EVENT_LOG_v1 {"time_micros": 1516397914005934, "job": 1089, "event": "table_file_deletion", "file_number": 504551}

All these symptoms went on for minutes, until rocksdb was finally able to get around to compacting the L0 files:

I180119 21:38:51.384973 17 storage/engine/rocksdb.go:74  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1466] [default] [JOB 1100] Compacting 21@0 + 1@3 files to L3, score 1
7.80
I180119 21:38:51.384993 17 storage/engine/rocksdb.go:74  [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_job.cc:1470] [default] Compaction start summary: Base version 1099 Base leve
l 0, inputs: [504591(61MB) 504585(27MB) 504579(61MB) 504573(63MB) 504566(32MB) 504560(32MB) 504493(60MB) 504484(64MB) 504478(66MB) 504472(65MB) 504465(47MB) 504460(44MB) 504454(59MB) 504447(62MB) 504441(5
9MB) 504435(65MB) 504429(59MB) 504423(59MB) 504395(54MB) 504340(43MB) 504290(45MB)], [504252(16MB)]

After this L0 compaction, everything on both the node and the cluster came unblocked.

There are a number of potential avenues to continue with here.

Why are manual compactions compacting 1 L6 file to 1 L6 file?
Why are manual compactions freeing up so little space, as in the couple examples above that only went from 963669401 to 858611159 bytes and 1366981518 to 1280350099 bytes?
Given the questions above, are manual compactions being properly bounded to parts of the key space that we know will be freed up?
Is there a way to prioritize manual compactions behind actual write batches and the compactions needed to allow them to continue?
Should we disable manual compactions during rebalancing? I'd naively think so, even given what @spencerkimball and @tschottdorf said above, because clearly the two combined can cause problems and we shouldn't need to rush compaction. If the answers to the above questions reveal other major problems, though, perhaps this isn't necessary.

I assume that @spencerkimball doesn't have much time to deal with this, despite having had time to introduce it :) @tschottdorf: you reviewed this -- would you like to take over, or should I continue with it? It's interrupted my work on https://github.com/cockroachdb/cockroach/issues/19985, but clearly we need to do something about it.

petermattis commented 6 years ago

Why are manual compactions compacting 1 L6 file to 1 L6 file?

I would guess the compaction of L6 to L6 is due to options.bottommost_level_compaction = rocksdb::BottommostLevelCompaction::kForce. We're forcing compactions of the bottommost level. I recall being concerned about this when the compaction queue was added, but don't recall the reason for leaving that flag set.

Is there a way to prioritize manual compactions behind actual write batches and the compactions needed to allow them to continue?

Note that I know of. There is SuggestCompactRange which might be a little friendlier to foreground operations. We might also investigate CompactRangeOptions::exclusive_manual_compaction. That defaults to true and there might be benefit to setting it to false.

spencerkimball commented 6 years ago

Agree with @petermattis on the force to bottommost layer. Also, it would be good to introduce a mechanism where the compaction queue only runs when the system is largely quiesced. No sense in having it kick in when there is any significant activity, especially active rebalancing. I also think we can make the compaction queue less trigger happy than it is.

I still have some time to look at it, because I want to get some other compaction stuff finished.

a-robinson commented 6 years ago

We might also investigate CompactRangeOptions::exclusive_manual_compaction.

Yeah, turning that off would probably avoid the backlog of L0 files that build up until the manual compaction finishes.

I still have some time to look at it, because I want to get some other compaction stuff finished.

Alright, I'll back off and let you take over until I hear otherwise. Want me to leave the alex-chaos roachprod cluster around for you to use? It took a few days of running for the cluster to get enough data in it for these events to start happening.

a-robinson commented 6 years ago

It's far from conclusive, but I started using a new binary a couple hours ago with CompactRangeOptions::exclusive_manual_compaction set to false, and the two chaos events since haven't caused any QPS dips.

Even if this is a sufficient fix, we still need someone to understand the manual compaction feature better, though, and to make sure we avoid rewriting files as we're clearly doing today (according to the logs pasted above).

a-robinson commented 6 years ago

There were no incidents all night with CompactRangeOptions::exclusive_manual_compaction set to false, so I'm feeling pretty good about that fixing the prioritization of foreground vs manual compactions. @spencerkimball is still going to take a closer look at some of the related problems, though.

spencerkimball commented 6 years ago

@a-robinson and I brainstormed yesterday and came up with an explanation for this. The chaos load will cause every node in the cluster to eventually get a replica for every range. After successive chaos events, each node will then end up with many of those replicas dead. When the data for a dead replica is cleared, a suggestion is added to the compactor suggested compaction queue. When a node comes back up, many of those replicas are rebalanced back to the node. Previously, we were doing nothing to take a suggestion out of the queue, which would mean that over time, the entire key range would be eligible for compaction, even though most of the data was actually present.

cockroachdb / cockroach

stability: Compaction queue causes major disruptions during chaos events #21528