storage: dropping a large table will brick a cluster due to compactions

benesch commented 6 years ago

$ roachprod create USER-FOO -n10
$ roachprod run USER-FOO 'mkdir -p /mnt/data1/cockroach && gsutil -m -q cp -r gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/stores=10/$((10#$(hostname | grep -oE [0-9]+$)))/* /mnt/data1/cockroach'

Wait ~10m for stores to download. Then drop the 2TiB table:

ALTER TABLE bank.bank EXPERIMENTAL CONFIGURE ZONE 'gc: {ttlseconds: 30}';
DROP TABLE bank.bank;

The cluster explodes a few minutes later as RocksDB tombstones pile up. I can no longer execute any SQL queries that read from/write to disk.

Very closely related to #21901, but thought I'd file a separate tracking issue.

/cc @spencerkimball

tbg commented 6 years ago

DistSender is presumably splitting this single operation into thousands of requests to the 10s of thousands of ranges in this table.

DistSender limits the parallelism, but I think it's to ~500. That's a lot if each of these requests does work. Not out of the question that there are cascading failures here.

petermattis commented 6 years ago

I think a CPU profile showed activity in a method related to tombstones, and upstream confirms that these are handled in a wildly inefficient manner (plus, we found hundreds of tombstones in some SSTs by manual dumping).

We're pretty much only creating range tombstones on the cluster in the clearrange roachtest. Seems reasonable that behavior would result in sstables with hundreds of range tombstones.

I think this also proves that the compactor is not the single root problem, as it's not running at that point in time (though maybe it is).

By compactor, do you mean RocksDB compactions, or the compaction queue?

I think the smoke test (in which you simply don't lay down tombstones as suggested above) can help reason about above-RocksDB problems to some extent (at least we all agree that there shouldn't be anything going wrong in that case, except not actually deleting anything).

I'll definitely try that test soon.

petermattis commented 6 years ago

I think the smoke test (in which you simply don't lay down tombstones as suggested above) can help reason about above-RocksDB problems to some extent (at least we all agree that there shouldn't be anything going wrong in that case, except not actually deleting anything).

Ok, I just tried this out and the clearrange roachtest passed without problem and the cluster looked good the whole time. The ClearRange operation was broken into 43777 RPCs which took 4m9s to complete. One thing I had noticed in past runs is that the consistency queue processing time was very high (10s of seconds), but during this run the consistency queue processing time was in the low hundreds of milliseconds. Not sure if this is relevant or not. Perhaps the consistency queue scans are hitting the large number of range tombstones and consuming excessive CPU.

petermattis commented 6 years ago

@tschottdorf I also tried this more minimal diff to your suggestion:

diff --git a/pkg/storage/batcheval/cmd_clear_range.go b/pkg/storage/batcheval/cmd_clear_range.go
index d2a95cc73..3ff8797d7 100644
--- a/pkg/storage/batcheval/cmd_clear_range.go
+++ b/pkg/storage/batcheval/cmd_clear_range.go
@@ -104,9 +104,9 @@ func ClearRange(
                        },
                },
        }
-       if err := batch.ClearRange(from, to); err != nil {
-               return result.Result{}, err
-       }
+       // if err := batch.ClearRange(from, to); err != nil {
+       //      return result.Result{}, err
+       // }
        return pd, nil
 }

That is, we simply omit the ClearRange RocksDB operation, while doing everything else the same. This too allows the clearrange roachtest to succeed. Some graphs:

screen shot 2018-05-28 at 3 15 15 pm

screen shot 2018-05-28 at 3 15 27 pm

screen shot 2018-05-28 at 3 16 05 pm

Logical bytes per store steadily decreases on the cluster, but because we're not actually deleting the data the disk used bytes doesn't change.

petermattis commented 6 years ago

Perhaps the consistency queue scans are hitting the large number of range tombstones and consuming excessive CPU.

I whacked the consistency queue code to force it off and the cluster still wedged. I also took a CPU profile on one of the nodes before the cluster wedged and I can see an excessive amount of time being spent in the RocksDB AddTombstones call. My new theory which I'm going to try and validate: we're adding hundreds of range tombstones per second which are winding up in the memtable and L0 tables. But each ClearRange RPC is also doing a stats computation which creates and iterator and needs to process those tombstones (via a call to AddTombstones). We might be better served by delaying the RocksDB ClearRange operation until immediately before performing the compaction, though that would need to interact properly with the consistency queue.

tbg commented 6 years ago

Yep, I came to a similar conclusion, though my prototype for this was too simplistic. Getting the interaction with the gap spanning compactions right is nontrivial, but might be time to do it. When I naively cleared the span that was going to be compacted, it caused assertions to fail.

On Mon, May 28, 2018, 9:45 PM Peter Mattis notifications@github.com wrote:

Perhaps the consistency queue scans are hitting the large number of range tombstones and consuming excessive CPU.

I whacked the consistency queue code to force it off and the cluster still wedged. I also took a CPU profile on one of the nodes before the cluster wedged and I can see an excessive amount of time being spent in the RocksDB AddTombstones call. My new theory which I'm going to try and validate: we're adding hundreds of range tombstones per second which are winding up in the memtable and L0 tables. But each ClearRange RPC is also doing a stats computation which creates and iterator and needs to process those tombstones (via a call to AddTombstones). We might be better served by delaying the RocksDB ClearRange operation until immediately before performing the compaction, though that would need to interact properly with the consistency queue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-392593235, or mute the thread https://github.com/notifications/unsubscribe-auth/AE135C2_KICdITIQLBvshTs8uWugkpdQks5t3FPogaJpZM4SwswS .

--

-- Tobias

petermattis commented 6 years ago

Yep, I came to a similar conclusion, though my prototype for this was too simplistic. Getting the interaction with the gap spanning compactions right is nontrivial, but might be time to do it. When I naively cleared the span that was going to be compacted, it caused assertions to fail.

Can you describe your prototype in more detail? (Or point to an existing explanation which I might have missed). Trying to reduce the number of range tombstones feels hacky and fragile. I'd rather put effort into fixing RocksDB to support range tombstones more efficiently (if that is the problem). Or figure out a way to optimistically skip writing range tombstones into the memtable (if range tombstones in L0 is the problem). Now that I'm up to speed on this issue, I think it would be worthwhile to brainstorm next steps tomorrow.

benesch commented 6 years ago

I’ll voice my support for patching RocksDB. I think it will be only approximately as much effort as trying to work around the issue in Cockroach.

On Mon, May 28, 2018 at 8:43 PM Peter Mattis notifications@github.com wrote:

Yep, I came to a similar conclusion, though my prototype for this was too simplistic. Getting the interaction with the gap spanning compactions right is nontrivial, but might be time to do it. When I naively cleared the span that was going to be compacted, it caused assertions to fail.

Can you describe your prototype in more detail? (Or point to an existing explanation which I might have missed). Trying to reduce the number of range tombstones feels hacky and fragile. I'd rather put effort into fixing RocksDB to support range tombstones more efficiently (if that is the problem). Or figure out a way to optimistically skip writing range tombstones into the memtable (if range tombstones in L0 is the problem). Now that I'm up to speed on this issue, I think it would be worthwhile to brainstorm next steps tomorrow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-392623192, or mute the thread https://github.com/notifications/unsubscribe-auth/AA15IHaKPM2pGHIfPlB8gi2dJRmCoVoRks5t3JmfgaJpZM4SwswS .

petermattis commented 6 years ago

I'm running into something strange in my latest experiment. I've added a SuggestedCompaction.Clear field which indicates to the compactor that the span should be cleared before being compacted. And when that field is cleared we always perform the compaction. Strangely, the compaction queue doesn't seem to be running in a timely fashion. It's definitely not disabled. I'm adding more logging to figure out what is going on.

PS The idea behind this experiment is to add the range tombstones and immediately flush them from the memtable and compact the associated range with the hope being that they don't affect concurrent operations that access L0.

petermattis commented 6 years ago

The strangeness mentioned in my previous comment was a bug. Making the compaction queue perform the RocksDB ClearRange immediately before the compaction seems to help as the concurrent scan operation in the clearrange roachtest is not getting wedged. On the other hand, the cluster doesn't seem to be reclaiming disk space. That could be indicative of some other badness going on. Or perhaps due to the compaction queue not forcing a bottommost-level compaction.

tbg commented 6 years ago

Can you describe your prototype in more detail? (Or point to an existing explanation which I might have missed). Trying to reduce the number of range tombstones feels hacky and fragile.

I wasn't suggesting implementing that properly, but was trying to figure out whether it would solve the problem. I agree that fixing this upstream is the most desirable option, but it's miles away from something I (and I assume, you) can conceivably prototype. To get an idea of whether it can fix this, you can configure our RocksDB reads to skip range tombstones (there's an option somewhere).

I've added a SuggestedCompaction.Clear field which indicates to the compactor that the span should be cleared before being compacted.

That's pretty much what I refer to as "prototype" before, except I did this in a worse way.

Strangely, the compaction queue doesn't seem to be running in a timely fashion. It's definitely not disabled.

It will always wait for at least two minutes from the first suggestion, by design. Were you seeing longer pauses?

My experiment branch is here: https://github.com/tschottdorf/cockroach/tree/roachtest/drop2tb-wip

tbg commented 6 years ago

That is, we simply omit the ClearRange RocksDB operation, while doing everything else the same. This too allows the clearrange roachtest to succeed. Some graphs:

Wait, and this is with the compactor enabled? That's surprising, for the compactions should at least somewhat impair the cluster, as they're now pretty expensive. You're basically compacting 2tb of full keyspace (though it's already all at L6, maybe that makes it close to a noop?). I think the compactions, when doing actual work, would show up. Maybe taking out ClearRange is accidentally masking additional concern coming from the compactor.

tbg commented 6 years ago

PS in my experiments it sometimes took >10 minutes after the ClearRange until things started falling apart. That is, I would generally like to see graphs on somewhat longer timescales -- that's when the compaction pressure really shows -- just to make sure it doesn't go downhill after we stopped looking.

petermattis commented 6 years ago

Wait, and this is with the compactor enabled? That's surprising, for the compactions should at least somewhat impair the cluster, as they're now pretty expensive. You're basically compacting 2tb of full keyspace (though it's already all at L6, maybe that makes it close to a noop?). I think the compactions, when doing actual work, would show up. Maybe taking out ClearRange is accidentally masking additional concern coming from the compactor.

Yes, this was with the compactor enabled. It is also possible I did something wrong in the experiment.

PS in my experiments it sometimes took >10 minutes after the ClearRange until things started falling apart. That is, I would generally like to see graphs on somewhat longer timescales -- that's when the compaction pressure really shows -- just to make sure it doesn't go downhill after we stopped looking.

I've seen the same behavior where sometimes the cluster wedges quickly and other times it takes many minutes for it to fall apart.

petermattis commented 6 years ago

It will always wait for at least two minutes from the first suggestion, by design. Were you seeing longer pauses?

No, I had a simple bug. I wasn't seeing any of my "clear range" compactions because I had added SuggestedCompaction.Clear which isn't persisted. I needed to move that to SuggestedCompaction.Compaction.Clear.

petermattis commented 6 years ago

My latest experiment was an extension of the SuggestedCompaction.Compaction.Clear approach. The ClearRange KV operation is changed to not perform a RocksDB ClearRange operation, but to instead create a suggested compaction with the Clear bit set. The compaction queue is modified to not aggregate compactions with the Clear bit set (this part was important to avoid multi-GB compactions). Before performing a "clearing" compaction, a RocksDB ClearRange operation is performed and then the memtable is flushed. I extended to the clearrange roachtest to run for an hour and the concurrent COUNT(*) operations never encountered problems.

screen shot 2018-05-29 at 11 50 31 am

Live bytes on the cluster very quickly dropped to 0.

screen shot 2018-05-29 at 11 50 27 am

Strangely, used bytes (which is computed based on the actual disk usage) increased. A more recent snapshot of this graph shows that used space is now decreasing rapidly.

screen shot 2018-05-29 at 11 57 29 am

screen shot 2018-05-29 at 11 58 55 am

Looks like RocksDB decided to start compacting, though I'm not sure what triggered that. Perhaps the compaction queue. I need to poke around more.

screen shot 2018-05-29 at 11 50 41 am

RocksDB sstables were decreasing, though I believe this metric counts the number of "live" sstables.

screen shot 2018-05-29 at 11 50 51 am

The number of open file descriptors didn't decrease with the number of sstables. This suggests some sort of sstable leak.

petermattis commented 6 years ago

I tried flushing the RocksDB memtable whenever a replica proposal contains suggested compactions. The idea being that the suggested compactions correspond to a batch containing a RocksDB ClearRange operation and it is beneficial to flush that operation. The cluster wedged quickly with this experiment. I'm going to add some debug logs to RocksDB to validate my understanding of what is going on at the RocksDB level.

petermattis commented 6 years ago

Investigation of RocksDB shows that RangeDelAggregator causes significant performance overhead to iterator creation when there are lots of tombstones in the memtable. The benchmark numbers below show how iterator creation/destruction slows down as the memtable goes from 0-10000 tombstones:

name                                              time/op
RocksDBDeleteRangeNewIterator/tombstones=0-8      1.26µs ± 4%
RocksDBDeleteRangeNewIterator/tombstones=1-8      2.54µs ± 2%
RocksDBDeleteRangeNewIterator/tombstones=10-8     5.57µs ± 2%
RocksDBDeleteRangeNewIterator/tombstones=100-8    50.8µs ± 3%
RocksDBDeleteRangeNewIterator/tombstones=1000-8    673µs ± 2%
RocksDBDeleteRangeNewIterator/tombstones=10000-8  9.19ms ± 2%

Note that in the clearrange test there are 43k ranges spread across 10 nodes, so ~13k replicas per node. Since there is no other traffic on the system, we can expect that the memtables on each of these nodes will contain ~13k tombstones (once the ClearRange operation finishes).

Note that the above slowdown is all CPU spinning. I ran another experiment, the kv95/encrypt=false/nodes=1 roachtest, with a cpu spin-loop in DBNewIter that varied from no-delay to a 10ms delay to see what this did to performance:

0us      21745 ops/sec
1us      21980 ops/sec
10us     21097 ops/sec
100us    16823 ops/sec
1ms      5853  ops/sec
10ms     799   ops/sec

When doing this experiment, CPU was absolutely pegged on my machine which matches up with what I've seen when running the clearrange test. I'm pretty confident at this point that usage of a large number of range tombstones is the core of this problem. And there is an acute problem with the existence of large numbers of range tombstones in the memtable and L0 because such tombstones result in significant cpu usage during iterator creation even when accessing unrelated parts of the key space. We can either attempt to fix RocksDB (see #26323), or workaround this limitation by being much more careful in how we create tombstones.

petermattis commented 6 years ago

I'm not sure what to make of this: I'm seeing DBCompactRange calls never return. I spotted one goroutine that had been blocked for 19min in a DBCompactRange call for a span of keys that was supposedly covering 32MiB. A CPU profile showed that RocksDB was churning CPU in snappy::internal::CompressFragment. It is possible that DBCompactRange has some sort of logic bug and is either looping or is trying to compact a range of keys significantly larger than requested.

petermattis commented 6 years ago

DBCompactRange breaks up a large compaction request into a series of smaller requests. It appears to have a bug in the logic to do this in handling both the beginning and ending keys to compact from/to. This is likely having a very bad effect on 2.0 performance whenever the compaction queue is requesting a compaction.

petermattis commented 6 years ago

Progress. The graphs below are from a run of the clearrange roachtest with the DBCompactRange bug fix and ClearRange performed by the compaction queue instead of by the ClearRange operation (this avoids a build up of thousands of range deletion tombstones).

This is the first time I've seen the Used bytes metric decreasing. Logging shows that the compaction queue is steadily churning through the ClearRange compactions.

Note that the compaction queued bytes metric is only updated at the end of each cycle of processing suggested compactions. That is easy to fix and would give a more accurate picture of the progress the compaction queue is making.

tbg commented 6 years ago

That looks promising! I knew something was up in CompactRange. Glad you've found a bug there.

For packing this up into something we can comfortably ship in 2.0.x, a challenge is dealing with the consistency violations created by deferring ClearRange to right before the compactions. I have the following suggestion there, though I don't know if it performs well enough - we'd have to try:

reduce the compaction queue min idle time from two minutes to something like ten seconds
perform engine.ClearRange from the ClearRange RPC, but tie it a little closer to the compaction queue. Instead of simply returning after the suggestion has been made, we block (below Raft per replica) until the compaction queue has processed the suggestion, and if the suggestion was discarded, force the compaction (or put a bit on the suggestion that avoids it being discarded; we 100% want it to happen in this case to get rid of the ranged tombstone).
this should still make steady process: DistSender can have up to ~100 requests in flight concurrently, and it also dispatches these requests in ascending key order.
to combat the case in which the schema changer 5 minute timeout is exceeded, make the ClearRange RPC a no-op if there's nothing to clear (we should do that anyway).

This is definitely not ideal, but perhaps it's good enough for a patch release. Working around the consistency issues is likely trickier and implies marking ranges as nuked, which is problematic since we're also trying to start merging ranges.

petermattis commented 6 years ago

perform engine.ClearRange from the ClearRange RPC, but tie it a little closer to the compaction queue. Instead of simply returning after the suggestion has been made, we block (below Raft per replica) until the compaction queue has processed the suggestion, and if the suggestion was discarded, force the compaction (or put a bit on the suggestion that avoids it being discarded; we 100% want it to happen in this case to get rid of the ranged tombstone).

Blocking below Raft makes me anxious. There shouldn't be any other activity on these ranges, but I wonder what unexpected activity we'll see that could cause this to be problematic.

This is definitely not ideal, but perhaps it's good enough for a patch release. Working around the consistency issues is likely trickier and implies marking ranges as nuked, which is problematic since we're also trying to start merging ranges.

Cc @benesch for his insight here. I was thinking we'd have a marker key that indicated a Range had been cleared and could not be written to again. I haven't thought through this very far. Perhaps this will be its own can of worms.

vivekmenezes commented 6 years ago

An option here is to revert the use of ClearRange() for 2.0 and get this in better shape for 2.1

petermattis commented 6 years ago

An option here is to revert the use of ClearRange() for 2.0 and get this in better shape for 2.1

I believe @tschottdorf tried this and not using ClearRange still left the cluster in a bad state. We should double check this.

benesch commented 6 years ago

Yes, he did. See: https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-374989269

benesch commented 6 years ago

Notice that the timestamps are ~5min apart. I believe this corresponds to the 5min SchemaChangeLeaseDuration. I haven't verified it yet, but I'd be willing to bet that the ClearRange operation is taking longer than 5min to complete which results in the schema change lease for this table being picked up by another node which then executes another ClearRange operation which will take longer than 5min and on and on.

We have schema changes that take much, much longer than five minutes. We just need to call sc.ExtendLease periodically while the ClearRange commands are occurring. We'll presumably want to plumb some notion of "progress" into the schema changer so that we don't extend the lease indefinitely if a ClearRange command really does wedge.

Now that the compaction queue is fixed, can we solve this with just an aggressive rate limit? If the schema changer did its own splitting of the ClearRange command, instead of relying on dist sender, it could send only n ClearRange commands per node per minute.

bdarnell commented 6 years ago

Blocking below raft (where exactly?) seems like a bad idea. And reducing the idle time of the compaction queue might increase wasted work (the reason it's so high is to allow multiple compactions to be coalesced to minimize the amount of data rewritten (remember that even though these ranges are empty, they're smaller than an sstable so there may be non-deleted data next to them).

Now that the compaction queue is fixed, can we solve this with just an aggressive rate limit? If the schema changer did its own splitting of the ClearRange command, instead of relying on dist sender, it could send only n ClearRange commands per node per minute.

This seems like the most viable option to me.

vivekmenezes commented 6 years ago

I'll work on a change to read all the ranges that are affected by ClearRange and then call the ClearRange on chunks of ranges (perhaps 1 to start with) with call to sc.ExtendLease in between. This is similar to how distsql does physical planning today.

petermattis commented 6 years ago

Just chatted with @benesch. He's going to work on the 2.0.x mitigation and has a few experiments to run, such as directly invoking DBCompactRange after committing a batch containing a ClearRange, flushing the memtable after committing a batch containing a ClearRange, nad prioritizing compacting sstables with too many range deletion tombstones (#24057) which didn't previously show any benefit but that benefit might have been obscured by the DBCompactRange bug. We're at the point now that we understand the outlines of what needs to be done, but need some more experimentation to determine which approach both works and is not too intrusive.

In parallel, I'm going to work on fixing range tombstones upstream in RocksDB. I doubt this work will be appropriate for back-porting to 2.0.x, but we'll know more in a week or two.

benesch commented 6 years ago

Ok, there's something very broken with the clearrange store dirs, and perhaps more broadly with whatever was used to generate them. On a hunch I applied this diff to the test to remove the DROP TABLE:

diff --git a/pkg/cmd/roachtest/clearrange.go b/pkg/cmd/roachtest/clearrange.go
index 70ace8262..106f48283 100644
--- a/pkg/cmd/roachtest/clearrange.go
+++ b/pkg/cmd/roachtest/clearrange.go
@@ -69,16 +69,6 @@ func registerClearRange(r *registry) {
                    return err
                }

-               // Set a low TTL so that the ClearRange-based cleanup mechanism can kick in earlier.
-               // This could also be done after dropping the table.
-               if _, err := conn.ExecContext(ctx, `ALTER TABLE bank.bank EXPERIMENTAL CONFIGURE ZONE 'gc: {ttlseconds: 30}'`); err != nil {
-                   return err
-               }
-
-               if _, err := conn.ExecContext(ctx, `DROP TABLE bank.bank`); err != nil {
-                   return err
-               }
-
                // Spend a few minutes reading data with a timeout to make sure the DROP
                // above didn't brick the cluster.
                //

Shouldn't be any compactions happening in that cluster, right?

$ roachprod run benesch-clearrange -- 'grep compactor.go logs/cockroach.log'
benesch-clearrange: grep compactor.go logs/cock... 10/10
   1: 
Process exited with status 1
   2: I180605 05:38:18.680504 169 storage/compactor/compactor.go:326  [n8,s8,compactor] processing compaction #20-92/463 (/Table/51/1/467222/0-/Table/51/1/2495965/0) for 427 MiB (reasons: size=true used=false avail=false)
I180605 05:38:19.210780 169 storage/compactor/compactor.go:342  [n8,s8,compactor] processed compaction #20-92/463 (/Table/51/1/467222/0-/Table/51/1/2495965/0) for 427 MiB in 530.241647ms
I180605 05:38:20.107331 169 storage/compactor/compactor.go:326  [n8,s8,compactor] processing compaction #345-373/463 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB (reasons: size=true used=false avail=false)
I180605 05:40:04.513392 169 storage/compactor/compactor.go:342  [n8,s8,compactor] processed compaction #345-373/463 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB in 1m44.405947506s
   3: I180605 05:38:54.126078 1849 storage/compactor/compactor.go:326  [n4,s4,compactor] processing compaction #419-1441/1442 (/Table/51/1/18669166-/Max) for 4.4 GiB (reasons: size=true used=false avail=false)
   4: 
Process exited with status 1
   5: 
Process exited with status 1
   6: I180605 05:38:32.403214 859 storage/compactor/compactor.go:326  [n5,s5,compactor] processing compaction #55-156/1843 (/Table/51/1/1364893/0-/Table/51/1/3487708) for 651 MiB (reasons: size=true used=false avail=false)
I180605 05:46:38.469327 859 storage/compactor/compactor.go:342  [n5,s5,compactor] processed compaction #55-156/1843 (/Table/51/1/1364893/0-/Table/51/1/3487708) for 651 MiB in 8m6.066106127s
I180605 05:46:40.070311 859 storage/compactor/compactor.go:326  [n5,s5,compactor] processing compaction #221-1842/1843 (/Table/51/1/4610515-/Max) for 5.8 GiB (reasons: size=true used=false avail=false)
   7: 
Process exited with status 1
   8: 
Process exited with status 1
   9: I180605 05:38:52.913420 915 storage/compactor/compactor.go:326  [n2,s2,compactor] processing compaction #39-75/1095 (/Table/51/1/2839711-/Table/51/1/3859338/0) for 257 MiB (reasons: size=true used=false avail=false)
I180605 05:41:04.507447 915 storage/compactor/compactor.go:342  [n2,s2,compactor] processed compaction #39-75/1095 (/Table/51/1/2839711-/Table/51/1/3859338/0) for 257 MiB in 2m11.594000298s
I180605 05:41:05.053543 915 storage/compactor/compactor.go:326  [n2,s2,compactor] processing compaction #239-1094/1095 (/Table/51/1/10569964-/Max) for 3.9 GiB (reasons: size=true used=false avail=false)
  10: I180605 05:38:20.757556 232 storage/compactor/compactor.go:326  [n9,s9,compactor] processing compaction #87-117/218 (/Table/51/1/8422088-/Table/51/1/10253285/0) for 288 MiB (reasons: size=true used=false avail=false)
I180605 05:38:20.870035 232 storage/compactor/compactor.go:342  [n9,s9,compactor] processed compaction #87-117/218 (/Table/51/1/8422088-/Table/51/1/10253285/0) for 288 MiB in 112.445892ms
Error:  Process exited with status 1

Pay special attention to the compactions that extend to /Max:

/Table/51/1/18669166-/Max
/Table/51/1/4610515-/Max

Sure enough, these pin read/write I/O at approximately ~75MB/s on the affected nodes, since one RocksDB thread is rewriting basically all the data in the store, constantly. This doesn't impact performance enough on its own to fail the roachtest, but you can bet that it's not helping matters. DROP TABLE can't rely on the compaction queue to purge range deletions on the affected nodes because the compaction queue is busy rewriting all the data in the darn store.

So where are these tombstones coming from? I have a hunch. I suspect that during the RESTORE to generate this data the last range at some point in time, say [/Table/51/1/4610515-/Max), got rebalanced from one node to another, leaving a deletion tombstone and compaction suggestion in its place. Unfortunately RESTORE than split that range approximately a billion times, installed hundreds of gigabytes of data in that area of the keyspace, and 3/10 of those replicas ended up on the node with the compaction suggestion. Installing these compaction suggestions when rebalancing seems fraught with peril.

I'll bet this is part of the reason that none of our strategies, until Peter's, made any dent in the problem. Anything that relies on the compaction queue to compact anything in a timely fashion is hopeless. (~One of the affected nodes is still compacting after 30m.~ It took 31m1s.) Peter (inadvertently?) turned off the compaction queue for preexisting suggestions and skirted this nastiness.

Another point of confusion: aren't these compaction suggestions supposed to expire after 24h? How are these suggested compactions still relevant given that the store dirs were generated months ago?

benesch commented 6 years ago

Another point of confusion: aren't these compaction suggestions supposed to expire after 24h? How are these suggested compactions still relevant given that the store dirs were generated months ago?

Oh. Suggestions only age out if they're unactionable for 24h. A two month old suggestion is still valid as long as you can act on it as soon as you see it.

tbg commented 6 years ago

Yowza. Good find @benesch. We should discard compactions immediately when they're above 2x the max age (between 1x and 2x the max age might still be good to compact because of the way the compactor times itself -- once it skips something once, it potentially sleeps for the max age).

tbg commented 6 years ago

Or rather, forget the 2x thing. Just make the main compaction queue loop timer maxAge/2 instead of maxAge as it is today. Then we can categorically not process suggestions older than maxAge.

petermattis commented 6 years ago

Nice find. I had noticed these existing suggested compactions (that’s why I disabled non-Clear compactions in the compaction queue) but wanted to focus on the ClearRange problem. I meant to revisit but this slipped my mind.

How many bugs will this issue uncover?

PS I’m tracking down another one: the tiny gap between sugggested compactions. Definitely not in the compactor code.

On Tue, Jun 5, 2018 at 2:14 AM Nikhil Benesch notifications@github.com wrote:

Ok, there's something very broken with the clearrange store dirs, and perhaps more broadly with whatever was used to generate them. On a hunch I applied this diff to the test to remove the DROP TABLE:

diff --git a/pkg/cmd/roachtest/clearrange.go b/pkg/cmd/roachtest/clearrange.go index 70ace8262..106f48283 100644 --- a/pkg/cmd/roachtest/clearrange.go +++ b/pkg/cmd/roachtest/clearrange.go @@ -69,16 +69,6 @@ func registerClearRange(r *registry) { return err }
// Set a low TTL so that the ClearRange-based cleanup mechanism can kick in earlier.

// This could also be done after dropping the table.

if _, err := conn.ExecContext(ctx, ALTER TABLE bank.bank EXPERIMENTAL CONFIGURE ZONE 'gc: {ttlseconds: 30}'); err != nil {

return err

}

if _, err := conn.ExecContext(ctx, DROP TABLE bank.bank); err != nil {

return err

}
      // Spend a few minutes reading data with a timeout to make sure the DROP
      // above didn't brick the cluster.
      //
Shouldn't be any compactions happening in that cluster, right?

$ roachprod run benesch-clearrange -- 'grep compactor.go logs/cockroach.log' benesch-clearrange: grep compactor.go logs/cock... 10/10 1: Process exited with status 1 2: I180605 05:38:18.680504 169 storage/compactor/compactor.go:326 [n8,s8,compactor] processing compaction #20-92/463 (/Table/51/1/467222/0-/Table/51/1/2495965/0) for 427 MiB (reasons: size=true used=false avail=false) I180605 05:38:19.210780 169 storage/compactor/compactor.go:342 [n8,s8,compactor] processed compaction #20-92/463 (/Table/51/1/467222/0-/Table/51/1/2495965/0) for 427 MiB in 530.241647ms I180605 05:38:20.107331 169 storage/compactor/compactor.go:326 [n8,s8,compactor] processing compaction #345-373/463 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB (reasons: size=true used=false avail=false) I180605 05:40:04.513392 169 storage/compactor/compactor.go:342 [n8,s8,compactor] processed compaction #345-373/463 (/Table/51/1/40766258/0-/Table/51/1/40860289) for 427 MiB in 1m44.405947506s 3: I180605 05:38:54.126078 1849 storage/compactor/compactor.go:326 [n4,s4,compactor] processing compaction #419-1441/1442 (/Table/51/1/18669166-/Max) for 4.4 GiB (reasons: size=true used=false avail=false) 4: Process exited with status 1 5: Process exited with status 1 6: I180605 05:38:32.403214 859 storage/compactor/compactor.go:326 [n5,s5,compactor] processing compaction #55-156/1843 (/Table/51/1/1364893/0-/Table/51/1/3487708) for 651 MiB (reasons: size=true used=false avail=false) I180605 05:46:38.469327 859 storage/compactor/compactor.go:342 [n5,s5,compactor] processed compaction #55-156/1843 (/Table/51/1/1364893/0-/Table/51/1/3487708) for 651 MiB in 8m6.066106127s I180605 05:46:40.070311 859 storage/compactor/compactor.go:326 [n5,s5,compactor] processing compaction #221-1842/1843 (/Table/51/1/4610515-/Max) for 5.8 GiB (reasons: size=true used=false avail=false) 7: Process exited with status 1 8: Process exited with status 1 9: I180605 05:38:52.913420 915 storage/compactor/compactor.go:326 [n2,s2,compactor] processing compaction #39-75/1095 (/Table/51/1/2839711-/Table/51/1/3859338/0) for 257 MiB (reasons: size=true used=false avail=false) I180605 05:41:04.507447 915 storage/compactor/compactor.go:342 [n2,s2,compactor] processed compaction #39-75/1095 (/Table/51/1/2839711-/Table/51/1/3859338/0) for 257 MiB in 2m11.594000298s I180605 05:41:05.053543 915 storage/compactor/compactor.go:326 [n2,s2,compactor] processing compaction #239-1094/1095 (/Table/51/1/10569964-/Max) for 3.9 GiB (reasons: size=true used=false avail=false) 10: I180605 05:38:20.757556 232 storage/compactor/compactor.go:326 [n9,s9,compactor] processing compaction #87-117/218 (/Table/51/1/8422088-/Table/51/1/10253285/0) for 288 MiB (reasons: size=true used=false avail=false) I180605 05:38:20.870035 232 storage/compactor/compactor.go:342 [n9,s9,compactor] processed compaction #87-117/218 (/Table/51/1/8422088-/Table/51/1/10253285/0) for 288 MiB in 112.445892ms Error: Process exited with status 1

Pay special attention to the compactions that extend to /Max:

/Table/51/1/18669166-/Max

/Table/51/1/4610515-/Max

Sure enough, these pin read/write I/O at approximately ~75MB/s on the affected nodes, since one RocksDB thread is rewriting basically all the data in the store, constantly. This doesn't impact performance enough on its own to fail the roachtest, but you can bet that it's not helping matters. DROP TABLE can't rely on the compaction queue to purge range deletions on the affected nodes because the compaction queue is busy rewriting all the data in the darn store.

So where are these tombstones coming from? I have a hunch. https://github.com/cockroachdb/cockroach/blob/93da1e6622fd1d943fa93822a38510ce7143226e/pkg/storage/replica.go#L789-L796 I suspect that during the RESTORE to generate this data the last range at some point in time, say [/Table/51/1/4610515-/Max), got rebalanced from one node to another, leaving a deletion tombstone and compaction suggestion in its place. Unfortunately RESTORE than split that range approximately a billion times, installed hundreds of gigabytes of data in that area of the keyspace, and 3/10 of those replicas ended up on the node with the compaction suggestion. Installing these compaction suggestions when rebalancing seems fraught with peril.

I suspect this is part of the reason none of our strategies, until Peter's, made any dent in the problem. Anything that relies on the compaction queue to compact anything in a timely fashion is hopeless. (One of the affected nodes is still compacting after 30m. It took 31m1s.) Peter (inadvertently?) turned off the compaction queue for preexisting suggestions and skirted this nastiness.

Another point of confusion: aren't these compaction suggestions supposed to expire after 24h? How are these suggested compactions still relevant?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-394593725, or mute the thread https://github.com/notifications/unsubscribe-auth/AF6f99tNwB8g12YVu7QNTG-Pouj7nExxks5t5iHTgaJpZM4SwswS .

tbg commented 6 years ago

Definitely not in the compactor code.

Must be DistSender?

knz commented 6 years ago

Discussed on Slack. An alternative is to have the SQL code rename the table upon DROP into some "trash" database, then issue a background job to delete the data. The background job could use smaller ClearRange requests (over subsets of the table) and alternate ClearRange with explicit compaction requests.

Also the work could be paced with some intermediate delay so that it is spread over a long period of time, to alleviate the performance impact on concurrent queries.

petermattis commented 6 years ago

DistSender seemed to be doing the right thing when I tried to reproduce on a smaller test scenario. Might be something specific to the clearrange store dirs or how they were generated (I.e. restore/import).

On Tue, Jun 5, 2018 at 5:46 AM Tobias Schottdorf notifications@github.com wrote:

Definitely not in the compactor code.

Must be DistSender?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/24029#issuecomment-394649069, or mute the thread https://github.com/notifications/unsubscribe-auth/AF6f9xeO6mNjuiWtCGnYENHryGLv6mJyks5t5lNYgaJpZM4SwswS .

benesch commented 6 years ago

PS I’m tracking down another one: the tiny gap between sugggested compactions. Definitely not in the compactor code. DistSender seemed to be doing the right thing when I tried to reproduce on a smaller test scenario. Might be something specific to the clearrange store dirs or how they were generated (I.e. restore/import).

This is ringing a distant bell about a RESTORE/IMPORT bug. I can't track it down, though.

Any hunches on how RESTORE/IMPORT could mis-generate data to cause this behavior? The range descriptors line up, right?

An alternative is to have the SQL code rename the table upon DROP into some "trash" database, then issue a background job to delete the data.

This happens already! Except that the background job (the schema changer) issues all the ClearRange requests at once. I have a WIP PR to throttle the ClearRange requests.

benesch commented 6 years ago

Erm, seems like this data set might just have a bunch of garbage ranges that span from, e.g., /Table/51/1/948949{-/0}.

This data set is pretty old, right? Not the store dirs, but the backup that they're generated from. Perhaps it's time to throw it away and generate a new one from scratch.

petermattis commented 6 years ago

Any hunches on how RESTORE/IMPORT could mis-generate data to cause this behavior? The range descriptors line up, right?

No hunches. I am going to check the range descriptors this morning.

petermattis commented 6 years ago

Erm, seems like this data set might just have a bunch of garbage ranges that span from, e.g., /Table/51/1/948949{-/0}.

I think we're seeing this on tpcc fixtures too. See https://github.com/cockroachdb/cockroach/issues/26375.

petermattis commented 6 years ago

The range descriptors line up. The problem is the empty ranges (see #26375). The clearrange data set (store dirs created from a bank fixture) shows:

root@localhost:26257/defaultdb> select * from [show testing_ranges from table bank.bank] limit 10;
+-----------+----------+----------+----------+--------------+
| Start Key | End Key  | Range ID | Replicas | Lease Holder |
+-----------+----------+----------+----------+--------------+
| NULL      | /0/0     |       22 | {1,3,5}  |            1 |
| /0/0      | /3273    |       37 | {1,3,5}  |            1 |
| /3273     | /3273/0  |       38 | {4,7,9}  |            4 |
| /3273/0   | /6546    |       23 | {2,5,7}  |            5 |
| /6546     | /6546/0  |       25 | {6,7,10} |            6 |
| /6546/0   | /9819    |       27 | {2,6,7}  |            6 |
| /9819     | /9819/0  |       29 | {6,7,10} |            6 |
| /9819/0   | /13092   |       61 | {5,7,9}  |            9 |
| /13092    | /13092/0 |       63 | {3,4,9}  |            4 |
| /13092/0  | /16365   |       65 | {5,6,7}  |            6 |
+-----------+----------+----------+----------+--------------+

See all of those empty ranges such as /3273-/3273/0? We're correctly sending ClearRange operations to each of those ranges, but because they are empty they don't result in suggested compactions because of this code. That code exists to avoid creating a range tombstone when only a small amount of data is being deleted. The result is that we see suggested compactions for /0/0-/3273, then /3273/0-/6546, then /6546/0-/9819, etc. So this answers the mystery of the gap between suggested compactions. Let's move the discussion to the cause of the empty ranges to #26375.

bdarnell commented 6 years ago

ClearRange should probably suggest a compaction even if it decides not to use a range tombstone. The compactor will ignore the suggestion if it's too small, but this allows it to stitch up the consecutive gaps.

petermattis commented 6 years ago

ClearRange should probably suggest a compaction even if it decides not to use a range tombstone. The compactor will ignore the suggestion if it's too small, but this allows it to stitch up the consecutive gaps.

The compactor already aggregates compactions together which don't overlap. See https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/compactor/compactor.go#L389-L400. But I don't allow aggregation for "clearing" compactions (in my PR). I could certainly have some additional logic to fix this. I'm running an experiment right now to see if it would be worthwhile as we should be able to have a far smaller number of range tombstones and perform far fewer compactions.

bdarnell commented 6 years ago

To return to @tschottdorf 's "blocking below raft" idea, what if we did something similar but blocked above raft? I'm thinking of a new semaphore tracking the number of range tombstones that have been written but not compacted away. We'd acquire the semaphore early when processing ClearRange (in beginCmds, around the time of the command queue), then release it either when ClearRange decides not to write a range tombstone or when the compaction it suggests has been processed (with some tuning of the compactor to not let clearing compactions sit around for too long).

This would limit the number of range tombstones that could be written within a single incarnation of the process. Would we need to persist something across restarts? Maybe, but hopefully not if we set the limit low enough.

benesch commented 6 years ago

Check it out:

That's with nothing but some ClearRange rate limiting. See #26449.

benesch commented 6 years ago

Shoot. I just ran an experiment on top of the compactor fix (#26488) and the clear range rate limiter (#26449) that crashed and burned:

Only two of the nodes seemed to be truly suffering, which I suppose is an improvement. It was enough to grind the schema change to a halt, though.

I didn't have enough logging on to tell exactly what caused the explosion, but the end result was 40 goroutines stuck in syscalls for multiple minutes: https://gist.github.com/benesch/cbe7fff9e1005056ca4355f3840a8c80

My money is on the consistency queue. If it's not the root cause it's certainly exacerbating the situation. There are 31 goroutines in computeChecksumPostApply.func1 in this stack trace, and I captured one earlier with 65 (!) such goroutines.

It seems that the consistency queue is missing a backpressure mechanism. If a leaseholder doesn't hear back from a follower within 5s, it assumes the follower is dead and moves on to the next replica; if the follower is not, in fact, dead, but severely overloaded, these consistency queue checks just pile up and pile up.

@petermattis @tschottdorf can either of you take a look at the stack trace and lend your insight?

tbg commented 6 years ago

Just to confirm your observation of the cons checker as the root cause, could you go through a few runs with the consistency checker turned off? And if that fixes the problem, how about a version where the cons checker runs, but with a very high timeout (instead of the 5s)? Once we have that data, I think we can make an informed decision:

either a consistency check at a time is enough to bring the cluster to its knees, and that would indicate that we need to tackle the range tombstones more directly (if a consistency check becomes really slow, real user traffic would, too -- though we should test that anyway)
if it's really the piling up that causes the problem, we can either avoid the piling up via context cancellation (I assume this just doesn't happen right now - should implement that either way) and/or rate limiting the number of active consistency check post apply triggers.

petermattis commented 6 years ago

I didn't have enough logging on to tell exactly what caused the explosion, but the end result was 40 goroutines stuck in syscalls for multiple minutes: https://gist.github.com/benesch/cbe7fff9e1005056ca4355f3840a8c80

The interesting part to me is that these goroutines are stuck in engine._Cfunc_DBIterSeek. Excessive range tombstones slow down iterator creation, but once the RangeDelAggregator is populated the overhead for seeking should be minimal (tombstones still have to be added for each sstable, but below L0 there shouldn't be many).

Perhaps we've stumbled upon another RocksDB bug. I'm struggling to imagine what could cause a Seek to take multiple minutes.

cockroachdb / cockroach

storage: dropping a large table will brick a cluster due to compactions #24029