roachtest: clearrange/zfs/checks=true failed

cockroach-teamcity commented 3 years ago

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 701b177d8f4b81d8654dfb4090a2cd3cf82e63a7:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/roachtest)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

jbowens commented 2 years ago

Can we start there? Doesn't pebble have a mode where it crashes on open iterators & prints a stack trace on Close()?

During (*pebble.DB).Close Pebble will return an error with the count of open iterators, if there are any. Unfortunately, it looks like that error is currently unchecked: https://github.com/cockroachdb/cockroach/blob/e27baa02ee9abc50d4445876e88c07181be77b55/pkg/storage/pebble.go#L891

I think we should be checking that error, but I suspect there will be many unit tests to fix up to get there.

I'm not sure how to track down this leaked iterator using (*pebble.DB).Close though.

jbowens commented 2 years ago

Forgot about this: #71481

petermattis commented 2 years ago

In race builds, Pebble keeps track of the stack trace for live iterators (see table_cache.go). We could patch Pebble to always keep track of the stacks for live iterators, and add a call to periodically dump this info to the logs. Since we have a way to reproduce this leakage I'm imagining that this could be used to find iterators that are unexpectedly stick around. I'd also add a creation timestamp to pebble.Iterator so that we could easily identify iterators that have been around for a while.

tbg commented 2 years ago

Thanks for the pointers. Hacked something together, on top of the known bad sha (that has the cancel problem): https://github.com/tbg/cockroach/compare/cb-bad...tbg:cb-bad-monitored?expand=1

If this works as I expect (I probably screwed something up and expect to have to do it over a few times...), we should see the stack traces for the iterators pretty easily.

tbg commented 2 years ago

Are these iterators for compaction use supposed to be open for that long? The iterator is created in (*DB).runCompaction and stuffed into a newCompactionIter. There is a defer iter.Close(), so I don't think the code would leak it. I didn't think a compaction would hang around for 16 minutes, so thought I'd share.

XXX slow iter: 16m16.407468278s:
goroutine 9172 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/cockroachdb/pebble.(*tableCacheShard).newIters(0xc00123c500, 0x0, 0x0, 0xc016bd7710, 0xc000e72a08)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:385 +0x4db
github.com/cockroachdb/pebble.(*tableCacheContainer).newIters(0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:119 +0x45
github.com/cockroachdb/pebble.(*compaction).newInputIter(0xc016bd7680, 0xc0006b14f0)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/compaction.go:1135 +0xe78
github.com/cockroachdb/pebble.(*DB).runCompaction(0xc00092fb00, 0x13, 0xc016bd7680, {0x8205440, 0xb2fa3c0})
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/compaction.go:2063 +0xd79
github.com/cockroachdb/pebble.(*DB).compact1(0xc00092fb00, 0xc016bd7680, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/compaction.go:1932 +0x2de
github.com/cockroachdb/pebble.(*DB).compact.func1({0x82a0528, 0xc001b023c0})
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/compaction.go:1893 +0xad
runtime/pprof.Do({0x82a04b8, 0xc000074110}, {{0xc0004bd840, 0x1027400, 0xc000c6b560}}, 0xc002155f88)
        /usr/local/go/src/runtime/pprof/runtime.go:40 +0xa3
github.com/cockroachdb/pebble.(*DB).compact(0xc002155f60, 0xc002155f70, 0xc002155f90)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/compaction.go:1890 +0x6b
created by github.com/cockroachdb/pebble.(*DB).maybeScheduleCompactionPicker
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/compaction.go:1686 +0x5c5

edit: they're still hanging around:

XXX slow iter: 56m17.910040674s:
goroutine 14294 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/cockroachdb/pebble.(*tableCacheShard).newIters(0xc00123c5a0, 0xb0, 0x0, 0xc00
24a5490, 0xc000e72a08)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/ta
ble_cache.go:385 +0x4db
github.com/cockroachdb/pebble.(*tableCacheContainer).newIters(0x47f78c0, 0x4a98bc0, 0x47
f7800, 0xc05f8cac00)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/ta
ble_cache.go:119 +0x45
github.com/cockroachdb/pebble.(*compaction).newInputIter(0xc0024a5400, 0xc0006b14f0)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/co
mpaction.go:1135 +0xe78
github.com/cockroachdb/pebble.(*DB).runCompaction(0xc00092fb00, 0x1da, 0xc0024a5400, {0x
8205440, 0xb2fa3c0})
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/co
mpaction.go:2063 +0xd79
github.com/cockroachdb/pebble.(*DB).compact1(0xc00092fb00, 0xc0024a5400, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/co
mpaction.go:1932 +0x2de
github.com/cockroachdb/pebble.(*DB).compact.func1({0x82a0528, 0xc0a3e88a20})
        /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/co
mpaction.go:1893 +0xad
runtime/pprof.Do({0x82a04b8, 0xc000074110}, {{0xc0004bd840, 0x100000001, 0xc0026e5380}},
 0xc001189788)
        /usr/local/go/src/runtime/pprof/runtime.go:40 +0xa3

tbg commented 2 years ago

I'm not seeing any long-open iterators originating from a cockroachdb/pkg stack (they occasionally show up, but then fade away again). All I have at the end of the import (on n1) are 402 stacks like the above (i.e. 402 open iters at the time), though really most of then in the 2-15 minute range.

petermattis commented 2 years ago

I believe the Pebble logs should have some indication of how long compactions are taking. Is there evidence in those logs of compactions taking 10s of minutes?

If I recall correctly, a compaction iterator doesn't have the same pinning behavior as a pebble.Iterator. With a pebble.Iterator there is an implicit snapshot of the sstable state that is pinned until the iterator is closed. With a compaction iterator I think we only pin the sstables that are part of the compaction. And such sstables don't show up as "zombie sstables" (i.e. ztbls) because they are in the process of being compacted. So I'm not sure if very long running compactions explains the high ztbl value.

I may have led you astray by pointing you to table_cache.go. That structure maintains a set of open sstable.Iterators, but a pebble.Iterator can be open without a corresponding sstable.Iterator being open. And an open pebble.Iterator should be the cause of high zombie sstable values.

tbg commented 2 years ago

Is there evidence in those logs of compactions taking 10s of minutes?

No. The log files also didn't seem to have rotated out (file name timestamp = cluster start).

$ grep 'compacted' cockroach-pebble.* | grep -Eo 'in [^ ]+' | sort | uniq
in 0.0s
in 0.1s
in 0.2s
in 0.3s
in 0.4s
in 0.5s
in 0.6s
in 0.7s
in 0.8s
in 0.9s
in 1.0s
in 1.1s
in 1.2s
in 1.3s
in 1.4s
in 1.6s
in 1.9s
in 2.1s
in 2.3s
in 2.4s
in 3.5s
in 4.0s
in 6.4s

petermattis commented 2 years ago

Is there evidence in those logs of compactions taking 10s of minutes?

No. The log files also didn't seem to have rotated out (file name timestamp = cluster start).

Huh? That's weird. I'm not going to have time today to look at this further. Let's tag in @jbowens or someone else from the storage team.

nicktrav commented 2 years ago

I will pick up the trail on the long running compactions piece. @tbg - if there are any newer artifacts for your specific runs I can look at, can you drop us a link?

tbg commented 2 years ago

Apologies, don't have them any more (running too many experiments these days), but here is the code:

https://github.com/tbg/cockroach/compare/cb-bad...tbg:cb-bad-monitored?expand=1

Here's how I invoked it:

$ cat repro.sh 
#!/bin/bash
set -euxo pipefail

export GCE_PROJECT=andrei-jepsen

git fetch tbg

git checkout ${1}
git reset --hard @{upstream}
git submodule update --init
./pkg/cmd/roachtest/roachstress.sh -c 1 -u clearrange/checks=false -- --debug --cpu-quota 9999

tbg commented 2 years ago

(You'll want to audit my pebble changes!)

tbg commented 2 years ago

In other news, I just ran the test on master and it still failed. At least one node ran out of disk again:

Jan 26 14:55:58 tobias-1643207278-01-n10cpu16-0005 bash[8244]: cockroach exited with code 10: Wed Jan 26 14:55:58 UTC 2022

I had previously verified that cb-good plus https://github.com/cockroachdb/cockroach/pull/75448 reliably passes. So maybe there is something else wrong here. I will investigate tomorrow.

Edit: I ran master four more times, and in all cases nodes ran out of disk. So this reliably fails. (Sometimes, it manifests as a test timeout because clearrange doesn't reliably use c.Monitor everywhere). And the ztbls look the same:

cockroach.tobias-1643212784-04-n10cpu16-0007.ubuntu.2022-01-26T17_24_58Z.008076.log:I220126 17:32:39.179296 393 kv/kvserver/store.go:3199 ⋮ [n7,s7] 222594 + ztbl 5739 98 G

cockroach-teamcity commented 2 years ago

roachtest.clearrange/checks=true failed with artifacts on master @ e5d1c374c31dc0e80a596c570da8dc45d73f80b8:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #70306 roachtest: clearrange/zfs/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-21.2]

This test on roachdash | Improve this report!

nicktrav commented 2 years ago

I have confirmed that there were no long running compactions (>1 minute) across any of the nodes in this failed run from 1/25. The longest I could find was ~40 seconds.

I also looked at n10 (the one that ran out of disk) and tracked the pebble LSM printouts. It looks like there iterator leak is there from the onset (node started at 09:36:14, and was restarted as part of the test at 10:16):

teamcity-4198403-1643096133-32-n10cpu16-0010> I220125 09:36:34.412168 390 kv/kvserver/store.go:3183 ⋮ [n10,s10] 186
zmemtbl         1   256 K
   ztbl        33   404 M
teamcity-4198403-1643096133-32-n10cpu16-0010> I220125 09:46:34.324019 390 kv/kvserver/store.go:3183 ⋮ [n10,s10] 1579
zmemtbl        28    27 M
   ztbl      2648    35 G
teamcity-4198403-1643096133-32-n10cpu16-0010> I220125 09:56:34.311856 390 kv/kvserver/store.go:3183 ⋮ [n10,s10] 2472
zmemtbl        49    48 M
   ztbl      3147    40 G
teamcity-4198403-1643096133-32-n10cpu16-0010> I220125 10:06:34.301719 390 kv/kvserver/store.go:3183 ⋮ [n10,s10] 3150
zmemtbl        70    73 M
   ztbl      3882    51 G
teamcity-4198403-1643096133-32-n10cpu16-0010> I220125 10:17:07.361337 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 111
zmemtbl         1   256 K
   ztbl       670    14 G
teamcity-4198403-1643096133-32-n10cpu16-0010> I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733
zmemtbl        28    27 M
   ztbl      2368    90 G

This probably isn't new information, but we can probably at least rule out any issue due to the node restarting. It's interesting that within 20 seconds of the test start we have 400M of zombie tables.

Going to look into patching Pebble some more to track these leaks at the pebble.Iterator level.

petermattis commented 2 years ago

Btw, I think this is the first time I've seen the ztbl metric significantly non-zero. Keep in mind the possibility that there is a bug around tracking it, as the code to track that metric is non-trivial.

tbg commented 2 years ago

I hope @nicktrav will just pinpoint the (supposed) leak soon, but as an additional datapoint, if I take master (at f918e4bae4c24fc2cad3a65292a0f347c9785232) and hard-code the consistency checker to disabled, >30 minutes in I don't see the ztbl build-up that Nick sees above (where it goes to 35G within 10min); I see anywhere between 0 and 1.5G but also perhaps more importantly the last value isn't always the largest value.

With the consistency checker on, this fails 100% of the time. I don't think these current runs will fail.

nicktrav commented 2 years ago

There's a confounding factor in that there seems to be some raciness with how the LSM metrics are updated (see cockroachdb/pebble#1478). The race could just be in the test, but it warrants some investigation.

I'm going to investigate that first, so we can at least have some confidence in the ztbl metric.

nicktrav commented 2 years ago

I unfortunately wasn't able to track this down ... yet. I spent most of my time reading through how the zombie tables / memtables are tracked and reported - found a few little bugs / improvements along the way that needed fixing (see above, for example).

Note to self for tomorrow (unless someone beats me to it) is to instrument pebble.readState to track when a pebble.Iterator is calling ref / unref on it. Tracking something like a map of active Iterators where the value contains information about where the Iterator was opened (or cloned) would be useful for narrowing in on this. Then when we can periodically report on readStates that are hanging around due to a non-zero ref count and therefore keeping a table in a zombie state.

tbg commented 2 years ago

I hope @nicktrav will just pinpoint the (supposed) leak soon, but as an additional datapoint, if I take master (at f918e4b) and hard-code the consistency checker to disabled, >30 minutes in I don't see the ztbl build-up that Nick sees above (where it goes to 35G within 10min); I see anywhere between 0 and 1.5G but also perhaps more importantly the last value isn't always the largest value.

With the consistency checker on, this fails 100% of the time. I don't think these current runs will fail.

I spoke to early. Two out of four did fail, with nodes running out of disk. However, no large ztbls. Remember, this is master with the consistency checker disabled. So at least on that master SHA (which we ultimately care about) we see out of disk, but it can't possibly be connected to the consistency checker. Here are the ztbl printouts of the out-of-disk nodes:

$ grep -Eo 'ztbl.*' logs/9.unredacted/cockroach.log
ztbl        21   208 M
ztbl         0     0 B
ztbl         0     0 B
ztbl         0     0 B
ztbl         0     0 B
ztbl         0     0 B
ztbl         0     0 B

Here's the last stats printout for the above node:

I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908  
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp__
_w-amp
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +    WAL         1    49 K       -    38 M       -       -       -       -    40 M       -       -       -  
   1.1
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +      0         0     0 B    0.00    40 M   1.3 G     106     0 B       0   5.1 M     220     0 B       0  
   0.1
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0  
   0.0
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +      2         0     0 B    0.00     0 B     0 B       0   3.3 M     157     0 B       0     0 B       0  
   0.0
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +      3         0     0 B    0.00   1.3 G   237 M      15    19 M     126   1.4 G     477   1.5 G       0  
   1.1
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +      4        10   1.4 M    0.41   238 M   4.5 G     298   1.2 G     340   1.3 G   1.0 K   1.4 G       1  
   5.7
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +      5      2357    37 G    1.00   4.7 G    61 G   3.9 K   748 M     196   9.8 G   1.1 K   9.8 G       1  
   2.1
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +      6     18872   308 G       -    20 G   372 G    24 K   186 M      68    51 G   2.1 K    51 G       1  
   2.5
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +  total     21239   345 G       -   439 G   439 G    28 K   2.2 G     887   502 G   4.9 K    64 G       3  
   1.1
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +  flush       203
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +compact      3131     0 B     0 B       0          (size == estimated-debt, score = in-progress-bytes, in =
 num-in-progress)
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +  ctype      1997     228      19     887       0  (default, delete, elision, move, read)
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 + memtbl         1   8.0 M
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +zmemtbl         0     0 B
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 +   ztbl         0     0 B
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 + bcache     116 K   1.5 G   37.3%  (score == hit-rate)
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 + tcache      19 K    12 M   98.8%  (score == hit-rate)
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 + titers         0
I220126 21:30:45.027623 405 kv/kvserver/store.go:3199 ⋮ [n9,s9] 908 + filter         -       -   93.5%  (score == utility)

On another run (n9 died too, but it's not the same as above):

$ grep -Eo 'ztbl.*' logs/9.unredacted/cockroach.log
ztbl        58   742 M
ztbl         6   513 M
ztbl        79   1.5 G

I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270  
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp_
__w-amp
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +    WAL         1   227 K       -    46 M       -       -       -       -    48 M       -       -       - 
    1.0
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +      0         0     0 B    0.00    47 M   617 M     107     0 B       0   6.8 M      78     0 B       0 
    0.1
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0 
    0.0
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +      2         0     0 B    0.00   1.4 M    62 K      53   724 K      12   1.6 M      46   1.7 M       0 
    1.1
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +      3        14    49 M    0.10   149 M   698 M      99    17 M      33   247 M     135   249 M       1 
    1.7
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +      4        55   579 M    0.93   1.0 G   3.9 G     236   198 M      79   2.5 G     478   2.5 G       1 
    2.4
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +      5      1968    29 G    1.00   4.2 G    39 G   2.6 K   383 M     132   8.9 G   1.1 K   8.9 G       1 
    2.1
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +      6     13833   225 G       -    26 G   236 G    14 K   885 M      69    45 G   2.1 K    50 G       1 
    1.7
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +  total     15870   254 G       -   280 G   280 G    18 K   1.4 G     325   337 G   3.9 K    62 G       4 
    1.2
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +  flush        72
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +compact      2044     0 B     0 B       0          (size == estimated-debt, score = in-progress-bytes, in 
= num-in-progress)
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +  ctype      1436      59     220     325       4  (default, delete, elision, move, read)
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 + memtbl         1   8.0 M
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +zmemtbl         1   8.0 M
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 +   ztbl        79   1.5 G
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 + bcache     248 K   3.5 G   27.5%  (score == hit-rate)
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 + tcache      16 K    10 M   98.1%  (score == hit-rate)
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 + titers         5
I220126 20:50:34.967932 378 kv/kvserver/store.go:3199 ⋮ [n9,s9] 2270 + filter         -       -   96.2%  (score == utility)

However, I looked through the metrics for these runs, and it seems increasingly clear that these nodes ran out of disk due to an imbalance, where n9 just takes on way more data than other nodes.

Comparing the live bytes on n9 to those on n1 (I know, lots of estimates in there, but more means more I'm pretty sure) shows that n9 just stores a lot more data:

Something similar can be seen on the write mb/sec graphs.

So we see two things: the consistency checker is responsible for the ztbl build-up. Also, without the consistency checker, the test will still fail (though less reliably) with nodes running out of disk due to imbalances.

cockroach-teamcity commented 2 years ago

roachtest.clearrange/checks=true failed with artifacts on master @ 83e2df701688745fe7e7d8a0d0e5e7a4ba8633c8:

          | I220127 10:00:41.318878 359 workload/pgx_helpers.go:79  [-] 13  pgx logger [error]: Exec logParams=map[args:[-3575643074335026364 d3] err:unexpected EOF sql:kv-2]
          | I220127 10:00:41.318898 343 workload/pgx_helpers.go:79  [-] 14  pgx logger [error]: Exec logParams=map[args:[222039589374760686 fb] err:unexpected EOF sql:kv-2]
          | I220127 10:00:41.318902 56 workload/pgx_helpers.go:79  [-] 15  pgx logger [error]: Exec logParams=map[args:[2625843109135472271 1a] err:unexpected EOF sql:kv-2]
          | I220127 10:00:41.318909 354 workload/pgx_helpers.go:79  [-] 16  pgx logger [error]: Exec logParams=map[args:[-6655112103693333074 71] err:unexpected EOF sql:kv-2]
          | I220127 10:00:41.318661 349 workload/pgx_helpers.go:79  [-] 17  pgx logger [error]: Exec logParams=map[args:[-4936849627709523729 66] err:unexpected EOF sql:kv-2]
          | I220127 10:00:41.318913 352 workload/pgx_helpers.go:79  [-] 18  pgx logger [error]: Exec logParams=map[args:[7349738824889806049 41] err:unexpected EOF sql:kv-2]
          | I220127 10:00:41.321241 356 workload/pgx_helpers.go:79  [-] 19  pgx logger [error]: Exec logParams=map[args:[-8268600697621342809 c9] err:unexpected EOF sql:kv-2]
          | I220127 10:00:41.321268 345 workload/pgx_helpers.go:79  [-] 20  pgx logger [error]: Exec logParams=map[args:[-8995084842725838189 4e] err:unexpected EOF sql:kv-2]
          | I220127 10:00:41.321279 57 workload/pgx_helpers.go:79  [-] 21  pgx logger [error]: Exec logParams=map[args:[-4526010663911828276 4d] err:unexpected EOF sql:kv-2]
          | I220127 10:00:41.321293 347 workload/pgx_helpers.go:79  [-] 22  pgx logger [error]: Exec logParams=map[args:[-5865337864101957698 a8] err:unexpected EOF sql:kv-2]
          | I220127 10:00:41.321310 357 workload/pgx_helpers.go:79  [-] 23  pgx logger [error]: Exec logParams=map[args:[2238598817082220739 2f] err:unexpected EOF sql:kv-2]
          | Error: unexpected EOF
          | I220127 10:00:41.321327 346 workload/pgx_helpers.go:79  [-] 24  pgx logger [error]: Exec logParams=map[args:[-61153047300492400 94] err:unexpected EOF sql:kv-2]
          | COMMAND_PROBLEM: exit status 1
          |   10: 
          | UNCLASSIFIED_PROBLEM: context canceled
        Wraps: (4) secondary error attachment
          | COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 9. Command with error:
          |   | ``````
          |   | ./cockroach workload run kv --concurrency=32 --duration=1h
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
        Wraps: (5) context canceled
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

    monitor.go:127,clearrange.go:207,clearrange.go:39,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 9: dead (exit status 10)
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
          | [...repeated from below...]
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (4) monitor command failure
        Wraps: (5) unexpected node event: 9: dead (exit status 10)
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #70306 roachtest: clearrange/zfs/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-21.2]

This test on roachdash | Improve this report!

tbg commented 2 years ago

Next experiment: master but we're not making the Replica.Send context cancelable:

https://github.com/cockroachdb/cockroach/compare/master...tbg:cb-bad-master-no-cancel?expand=1

Three out of four failed, nodes running out of disk. The replicas per node are quite imbalanced, the ztbls stay bounded/small. Also, this happens during the IMPORT phase; we never even drop the table.

Starting to look increasingly likely that there are two failure modes. On the initial bad sha, it's related to some SST leakage (perhaps), somehow connected to ctx cancellation in the consistency checksum computation.

On the master, even taking ctx cancellation completely out of the picture, we're running out of disk as well, but it looks to be due to a replica imbalance.

nicktrav commented 2 years ago

On the initial bad sha, it's related to some SST leakage (perhaps), somehow connected to ctx cancellation in the consistency checksum computation.

I'm unfamiliar with this part of the database, but reasoning through it now I'm guessing it is expected to see ztbl increase during consistency checks? Doesn't it need to iterate over the LSM? Assuming we open a pebble.Iterator for this, then if the tables in the LSM are churning (lots going on due to the presence of RANGEDELs for the clearrange), the iterator is keeping the tables in zombie state. Then the context cancellation somehow leaves the iterator open?

This is my hypothesis to test today (hopefully) by instrumenting the readState.

petermattis commented 2 years ago

The consistency checker checks a single range at a time, so it is scanning a small fraction of the LSM. The scanning is throttled somehow, though I'm not familiar with the details. I'd expect an iterator for the consistency checker to only be open for a few seconds. Worth checking whether that is the case.

tbg commented 2 years ago

That's the theory I have spent most of the time on (see for example here) https://github.com/cockroachdb/cockroach/issues/68303#issuecomment-1022205794. I am fairly certain we don't leak snapshots, and most of the time we have no snapshots open (according to my hacky pebble instrumentation which you have seen). The consistency check creates a single iterator, which it also closes no matter what. I would be delighted for your instrumentation to prove me wrong; there must be something about the consistency checks since they conclusively correlate with the high ztbl count. But it could be that the consistency check is simply the only thing reading the data, and that this alone triggers a problem.

nicktrav commented 2 years ago

Starting to get a trickle of data in from a stress test that I started, and seeing some iterators hang around for ~6 mins (and it's been roughly 6 mins since the DB restarted as part of the clearrange test).

Here's an example (I will continue to update as I see different call stacks). It's clearly showing the consistency checker:

iter: merging, age: 5m59.353136117s (allocated @ 2022-01-27 19:35:37.58971254 +0000 UTC m=+13.010272218), stack:
goroutine 2508 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/cockroachdb/pebble.newIterDebug(0xc0038d6000)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/read_state.go:146 +0x59
github.com/cockroachdb/pebble.(*readState).addDebug(0xc003259080, 0x1000)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/read_state.go:171 +0x4c
github.com/cockroachdb/pebble.(*DB).newIterInternal(0xc000040480, 0x0, 0xc0026d1380, 0xc004cf6c88)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:912 +0x295
github.com/cockroachdb/pebble.(*Snapshot).NewIter(0xc0026d1380, 0xc0048248c8)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/snapshot.go:48 +0x2c
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).init(0xc004cf6c80, {0x82fae10, 0xc0026d1380}, {0x0, 0x0}, {0x0, {0xaf81cee, 0x1, 0x1}, {0xc0008f9318, ...}, ...})
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:179 +0x7b6
github.com/cockroachdb/cockroach/pkg/storage.newPebbleIterator({0x82fae10, 0xc0026d1380}, {0x0, 0x0}, {0x0, {0xaf81cee, 0x1, 0x1}, {0xc0008f9318, 0x6, ...}, ...})
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:91 +0xbb
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleSnapshot).NewMVCCIterator(0xc0045e0df8, 0xc004784030, {0x0, {0xaf81cee, 0x1, 0x1}, {0xc0008f9318, 0x6, 0x8}, 0x0, ...})
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble.go:2018 +0x185
github.com/cockroachdb/cockroach/pkg/storage.newIntentInterleavingIterator({0x83f5e08, 0xc0045e0df8}, {0x0, {0xaf81cee, 0x1, 0x1}, {0xc0008f9318, 0x6, 0x8}, 0x0, ...})
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/storage/intent_interleaving_iter.go:234 +0x542
github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingReader).NewMVCCIterator(0x0, 0x0, {0x0, {0x0, 0x0, 0x0}, {0xc0008f9318, 0x6, 0x8}, 0x0, ...})
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/storage/intent_reader_writer.go:166 +0xf8
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleSnapshot).NewMVCCIterator(0xc0045e0df8, 0x0, {0x0, {0x0, 0x0, 0x0}, {0xc0008f9318, 0x6, 0x8}, 0x0, ...})
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble.go:2011 +0xcc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sha512(0xc002e07500, {0x835b888, 0xc0026d13b0}, {0x104, {0xc0008f9310, 0x6, 0x8}, {0xc0008f9318, 0x6, 0x8}, ...}, ...)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_consistency.go:634 +0x817
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).computeChecksumPostApply.func1.1({0x83f5e08, 0xc0045e0df8}, {{0xff, 0xb8, 0xba, 0x3f, 0x82, 0xba, 0x4e, 0x5c, ...}, ...}, ...)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:249 +0x16a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).computeChecksumPostApply.func1({0x835b888, 0xc0026d13b0})
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:255 +0xa5
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x16f
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:483 +0x445

nicktrav commented 2 years ago

For these long-lived iterators, I looked at their goroutines, and it seems like they are all stuck waiting to acquire something from the quota pool?

goroutine 2508 [select]:
github.com/cockroachdb/cockroach/pkg/util/quotapool.(*AbstractPool).Acquire(0xc0004520b0, {0x835b888, 0xc0026d13b0}, {0x82dff40, 0xc00a5c1a90})
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/quotapool/quotapool.go:281 +0x77c
github.com/cockroachdb/cockroach/pkg/util/quotapool.(*RateLimiter).WaitN(0xc0013dce10, {0x835b888, 0xc0026d13b0}, 0x2811)
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/quotapool/int_rate.go:59 +0xec
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sha512.func1({{0xc001efa200, 0x7, 0x20}, {0x16ce35e02b2ff787, 0x0, 0x0}}, {0x7fe1574731df, 0x280a, 0x280a})
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_consistency.go:581 +0xe6
github.com/cockroachdb/cockroach/pkg/storage.ComputeStatsForRange({0x83c69b0, 0xc002c1b300}, {0xc0008f9310, 0x6, 0x8}, {0xc0008f9318, 0x6, 0x8}, 0x0, {0xc002bb3a38, ...})
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:3914 +0xd43
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sha512(0xc002e07500, {0x835b888, 0xc0026d13b0}, {0x104, {0xc0008f9310, 0x6, 0x8}, {0xc0008f9318, 0x6, 0x8}, ...}, ...)
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_consistency.go:636 +0x8b1
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).computeChecksumPostApply.func1.1({0x83f5e08, 0xc0045e0df8}, {{0xff, 0xb8, 0xba, 0x3f, 0x82, 0xba, 0x4e, 0x5c, ...}, ...}, ...)
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:249 +0x16a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).computeChecksumPostApply.func1({0x835b888, 0xc0026d13b0})
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:255 +0xa5
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 +0x16f
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:483 +0x445

I see a ton of goroutines with similar states (over 60 on this node, and ztlbs = 158 at the time I took the dump).

nvanbenschoten commented 2 years ago

Nice job extracting all of that Nick! That goroutine is interesting both because it is not stuck ([select]) and because there are over 60 of these computeChecksumPostApply functions firing at once on a single node. That indicates that these calls to computeChecksumPostApply are not stuck, but are very slow.

Could this slowness be explained by unexpectedly high consistency checker concurrency? All consistency checks on a single node will share the same consistencyLimiter rate limiter, which defaults to a rate of 8MB/s. Split across 60 ranges, that's 140KB/s per range. So the time to scan a single range will be 512MB / 140KB/s = 3840s = 64m.

@tbg would any of the replication circuit breaker work have led to the consistency checker queue detaching its context from an ongoing consistency check and moving on without the consistency check being canceled? If so, could this explain why we have more consistency checks running concurrently than the individual consistencyQueues should allow? And then the shared rate limiter would explain why these checks are getting slower and slower as more consistency checks leak.

nicktrav commented 2 years ago

Should have also mentioned that the stacks from above were taken from latest master (branched from fa93c68218af8094f24f421d233ee9d344b7303c).

Here's a look at what we're calling the "bad" sha (i.e. 6664d0c34df0fea61de4fff1e97987b7de609b9e). Same problem, just much more pronounced:

LSM state (on worst node):

Store 6:
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1   6.5 M       -   220 M       -       -       -       -   221 M       -       -       -     1.0
      0         0     0 B    0.00   215 M   685 M     137     0 B       0    52 M      72     0 B       0     0.2
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         7    25 M    0.57    39 M     0 B       0     0 B       0    96 M      28    97 M       1     2.4
      3        21   103 M    0.33   252 M   1.2 G      96    31 M      12   309 M     113   329 M       1     1.2
      4       355   4.3 G    1.00   1.2 G    21 G   1.7 K   638 M     100   1.6 G     318   1.6 G       1     1.4
      5      2171    29 G    1.00   7.2 G   108 G   7.1 K    11 G   1.0 K    11 G   1.0 K    11 G       1     1.5
      6      2803    98 G       -    97 G   1.6 G     125    36 M       8   169 G   5.9 K   169 G       1     1.7
  total      5357   131 G       -   132 G   132 G   9.1 K    12 G   1.1 K   314 G   7.4 K   182 G       5     2.4
  flush        36
compact      4326    66 G   4.6 M       1          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype      3159       5      29    1129       4  (default, delete, elision, move, read)
 memtbl         1    64 M
zmemtbl        13   280 M
   ztbl      6294   105 G
 bcache     220 K   3.2 G   20.3%  (score == hit-rate)
 tcache      10 K   6.5 M   97.8%  (score == hit-rate)
 titers      2643
 filter         -       -   98.2%  (score == utility)

530 goroutines computing the consistency checks:

0 | quotapool | quotapool.go:281 | (*AbstractPool).Acquire(#1309, {#4, *}, {#134, *})
1 | quotapool | int_rate.go:59 | (*RateLimiter).WaitN(#307, {#4, *}, *)
2 | kvserver | replica_consistency.go:581 | (*Replica).sha512.func1({{*, *, *}, {*, 0, 0}}, {*, *, *})
3 | storage | mvcc.go:3902 | ComputeStatsForRange({#147, *}, {*, *, *}, {*, *, *}, 0, {*, ...})
4 | kvserver | replica_consistency.go:636 | (*Replica).sha512(*, {#4, *}, {*, {*, *, 8}, {*, *, 8}, ...}, …)
5 | kvserver | replica_proposal.go:247 | (*Replica).computeChecksumPostApply.func1.1({#156, *}, {{*, *, *, *, *, *, *, *, ...}, ...}, …)
6 | kvserver | replica_proposal.go:253 | (*Replica).computeChecksumPostApply.func1({#4, *})
7 | stop | stopper.go:488 | (*Stopper).RunAsyncTaskEx.func2()

nicktrav commented 2 years ago

Going to keep digging on master, as I'm fairly certain that 71f0b3466dbbc6f8fc751e846ca85a87da7410da alleviates much of the problem we were seeing. Though, as PR #75448 mentions, there's likely an alternative failure mode.

nicktrav commented 2 years ago

Last update for the evening. Spent the remainder of today looking less at the ztbls leak (from what I'm seeing, after 71f0b34 we tend to see brief periods of elevation, but never anything near as bad as before that commit), and more on the replica imbalances problem that @tbg mentioned which is preventing the test from even getting to the actual "clearrange" step.

On the master, even taking ctx cancellation completely out of the picture, we're running out of disk as well, but it looks to be due to a replica imbalance.

Sampling some commits, I'm noticing the following "good" vs. "bad" behavior:

Good (fair allocation of replicas across all nodes):

Bad (some nodes run out of disk and stall the import):

I started a bisect, but it was taking some time. I'll pick this up again tomorrow.

tbg commented 2 years ago

@tbg would any of the replication circuit breaker work have led to the consistency checker queue detaching its context from an ongoing consistency check and moving on without the consistency check being canceled? If so, could this explain why we have more consistency checks running concurrently than the individual consistencyQueues should allow? And then the shared rate limiter would explain why these checks are getting slower and slower as more consistency checks leak.

I think I see what the problem is. I had actually thought about it before, but erroneously convinced myself that it wasn't an issue. Here's what it looks like on the leaseholder on the bad sha (i.e. ctx cancels):

leaseholder proposes ComputeChecksum
applies cmd, spawns async computation
ctx cancels
async computation stops immediately
leaseholder's long-poll to wait for the result errors out
leaseholder handles next range. goto 1

so no concurrency. But step 2 also happens on each follower, and there it will not have a cancelable context associated to it. So there:

applies cmd, spawns async
runs for a long time
but in the meantime the leaseholder is already doing ten more ranges that also all went through steps 1+2
have 60+ consistency checks running, oops

So basically the problem is that if a consistency check fails fast on the leader, this doesn't cancel the in-flight computation on the follower. Since each node is a follower for lots of ranges, we had tons of consistency checks running on each node.

What's curious is that when I ran this experiment I should've seen lots of snapshots open but I didn't, but maybe my instrumentation was wrong or the test never got to the point where it exhibited this problem (the graceful shutdowns I introduced after the import hung, I think).

With the cancellation fix, we're close to the previous behavior. The only difference is that previously, the computation on the leaseholder was canceled when the consistency checker queue gave up. But like before this wouldn't affect the followers if they still had the computation ongoing.

I think this might put a pin in the high ztbl count, right? Thanks for all of the work getting us here @nicktrav!

tbg commented 2 years ago

How are you bisecting, btw? Are you going through all 319 commits https://github.com/cockroachdb/cockroach/compare/cd1093d5f7...8eaf8d20ea? It sounds as though the strategy for each bisection attempt would be to cherry-pick https://github.com/cockroachdb/cockroach/commit/71f0b3466dbbc6f8fc751e846ca85a87da7410da on top, but are we even sure this is "good" for any of the commits in that range?

nvanbenschoten commented 2 years ago

So basically the problem is that if a consistency check fails fast on the leader, this doesn't cancel the in-flight computation on the follower. Since each node is a follower for lots of ranges, we had tons of consistency checks running on each node.

This makes a lot of sense. I'll still suggest that we should think carefully about whether the client ctx cancellation is the root of the problem, or whether it's actually https://github.com/cockroachdb/cockroach/commit/d06405992f63ec553f804706e69ec1848bd77efb. The ability for a client ctx cancellation to propagate to Raft log application on the proposer replica seems like a serious problem to me. It breaks determinism, the cornerstone of the whole "replicated state machine" idea. I'm actually surprised this hasn't caused worse issues, like a short-circuited split on the proposer. We must just not currently check for context cancellation in many places below Raft.

erikgrinaker commented 2 years ago

I fully agree with Nathan here. That commit was motivated by propagating tracing information through command application, but it should not propagate cancellation signals.

cockroach-teamcity commented 2 years ago

roachtest.clearrange/checks=true failed with artifacts on master @ 71becf337d9d2731298dc092f3ce9cf0f0eedb2c:

          | I220128 10:37:27.336682 337 workload/pgx_helpers.go:79  [-] 23  pgx logger [error]: Exec logParams=map[args:[-6450913955317917568 f1] err:unexpected EOF sql:kv-2]
          | I220128 10:37:27.340792 326 workload/pgx_helpers.go:79  [-] 24  pgx logger [error]: Exec logParams=map[args:[8829663467242086327 62] err:unexpected EOF sql:kv-2]
          | I220128 10:37:27.336700 343 workload/pgx_helpers.go:79  [-] 25  pgx logger [error]: Exec logParams=map[args:[-3644533257171351169 0b] err:unexpected EOF sql:kv-2]
          | I220128 10:37:27.336710 323 workload/pgx_helpers.go:79  [-] 26  pgx logger [error]: Exec logParams=map[args:[3192999095032280912 da] err:unexpected EOF sql:kv-2]
          | I220128 10:37:27.336696 346 workload/pgx_helpers.go:79  [-] 27  pgx logger [error]: Exec logParams=map[args:[6493141783003117667 97] err:unexpected EOF sql:kv-2]
          | I220128 10:37:27.340831 327 workload/pgx_helpers.go:79  [-] 28  pgx logger [error]: Exec logParams=map[args:[1555708056282946553 c5] err:unexpected EOF sql:kv-2]
          | I220128 10:37:27.340861 65 workload/pgx_helpers.go:79  [-] 29  pgx logger [error]: Exec logParams=map[args:[1826535142466176772 5b] err:unexpected EOF sql:kv-2]
          | I220128 10:37:27.340876 64 workload/pgx_helpers.go:79  [-] 30  pgx logger [error]: Exec logParams=map[args:[1318876305802062279 3a] err:unexpected EOF sql:kv-2]
          | I220128 10:37:27.340893 344 workload/pgx_helpers.go:79  [-] 31  pgx logger [error]: Exec logParams=map[args:[1728666596595591428 ca] err:unexpected EOF sql:kv-2]
          | I220128 10:37:27.340911 59 workload/pgx_helpers.go:79  [-] 32  pgx logger [error]: Exec logParams=map[args:[-6613865651839355368 ca] err:unexpected EOF sql:kv-2]
          | I220128 10:37:27.340927 61 workload/pgx_helpers.go:79  [-] 33  pgx logger [error]: Exec logParams=map[args:[-3523718973629480045 3a] err:unexpected EOF sql:kv-2]
          | I220128 10:37:27.340941 62 workload/pgx_helpers.go:79  [-] 34  pgx logger [error]: Exec logParams=map[args:[-8232659246879096639 05] err:unexpected EOF sql:kv-2]
          | Error: unexpected EOF
          | COMMAND_PROBLEM: exit status 1
          |   10: 
          | UNCLASSIFIED_PROBLEM: context canceled
        Wraps: (4) secondary error attachment
          | COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 9. Command with error:
          |   | ``````
          |   | ./cockroach workload run kv --concurrency=32 --duration=1h
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
        Wraps: (5) context canceled
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

    monitor.go:127,clearrange.go:207,clearrange.go:39,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 9: dead (exit status 10)
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
          | [...repeated from below...]
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (4) monitor command failure
        Wraps: (5) unexpected node event: 9: dead (exit status 10)
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #70306 roachtest: clearrange/zfs/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-21.2]

This test on roachdash | Improve this report!

andreimatei commented 2 years ago

We must just not currently check for context cancellation in many places below Raft.

Is there a good reason why we check for cancellation in any places below Raft?

tbg commented 2 years ago

Is there a good reason why we check for cancellation in any places below Raft?

There may not be, but it seems brittle pass a cancelable context into a subsystem that must not check cancellation. It's both more robust and also, in my view, more appropriate to execute state machine transitions under a context that does not inherit the wholly unrelated client cancellation.

I think we should massage

https://github.com/cockroachdb/cockroach/blob/852a80c5a2efce038182343c55aae328629ba5e0/pkg/kv/kvserver/replica_application_decoder.go#L139-L171

such that it avoids deriving from cmd.proposal.ctx (i.e. it can create a derived span, but not become a child of the proposer ctx).

tbg commented 2 years ago

Filed https://github.com/cockroachdb/cockroach/issues/75656

andreimatei commented 2 years ago

There may not be, but it seems brittle pass a cancelable context into a subsystem that must not check cancellation

I kinda see it the other way around. The subsystem should be robust against any context passed into it. Depending on where you draw the boundary of the subsystem on question, you can say that raft application can be made robust by switching to a non-cancelable context itself. But still, if there's code that only ever runs below Raft, I think we should take out all the cancellation checks (at the very least, for clarity).

nicktrav commented 2 years ago

How are you bisecting, btw? ... It sounds as though the strategy for each bisection attempt would be to cherry-pick 71f0b34 on top

Yeah, taking this approach. There are only 8-ish steps in a full bisect, but it's a little bit of extra work to cherry-pick, etc.. So a little slower going.

are we even sure this is "good" for any of the commits in that range?

I don't think we are based on what came in while I was offline. That said, if I treat a "good" signal for this bisect as whether the replicas are balanced I seem to be zeroing in.

nvanbenschoten commented 2 years ago

Depending on where you draw the boundary of the subsystem on question, you can say that raft application can be made robust by switching to a non-cancelable context itself.

Right, I think this is what we're saying, and what is proposed in https://github.com/cockroachdb/cockroach/issues/75656.

But still, if there's code that only ever runs below Raft, I think we should take out all the cancellation checks (at the very least, for clarity).

Trying to make this guarantee is the part that seems brittle. Even if we carefully audit and ensure that we don't perform context cancellation checks directly in Raft code, it's hard to guarantee that no lower-level logic or library that Raft code calls into will perform such checks. For instance, I broke this guarantee in https://github.com/cockroachdb/cockroach/pull/73279 while touching distant code, which Erik fixed in https://github.com/cockroachdb/cockroach/pull/73484. There are also proposals like https://github.com/golang/go/issues/20280 to add context cancellation awareness to the filesystem operations provided by the standard library. If we don't want a subsystem to respect context cancellation, it's best not to give it a cancellable context.

nvanbenschoten commented 2 years ago

I think we should massage .. such that it avoids deriving from cmd.proposal.ctx (i.e. it can create a derived span, but not become a child of the proposer ctx).

Maybe we could even remove cmd.proposal.ctx entirely. We already extract the tracing span (cmd.proposal.sp) from the context.

tbg commented 2 years ago

Let's continue discussing this on #75656, this comment thread is already pretty unwiedly.

nicktrav commented 2 years ago

I had some luck with the bisect on the replica imbalance issue.

I've narrowed it down to e12c9e65c457e035562e71621d164f04abb33728. On master with this commit included I see the following behavior on import:

I then ran a branch with just this commit excluded and the replicas are far more balanced and the import is able to succeed:

I don't have enough context to say whether, outside the context of just the clearrange/* tests, that commit would cause issues elsewhere. It could just be a matter of giving these test workers more headroom to allow them to complete the import, and then potentially rebalance to a more even state? cc: @dt - happy to spin up a new issue for this to avoid piling onto this ticket.

Once the import succeeds, we're into the (well documented) realm of #75656 - ztbls isn't terrible, but we have a lot of goroutines (~100 on each node) running the consistency checks, which is slowing down test overall, and preventing the disk space from being reclaimed (not an issue for the remainder of the test, as we're just deleting).

In terms of debugging this specific test failure, I think we've found the two separate issues we theorized.

tbg commented 2 years ago

Great work, and much appreciate the consistent extra miles you're going.

I think we should close this issue, and file a separate issue about what you have found, and then link it to the new clear range failure issue once the next nightly creates it.

dt commented 2 years ago

Well that's pretty interesting / mildly surprising since this is an ordered ingestion import, where we didn't expect this bulk-sent split size to do much other an aggravate the merge queue an hour later.

nicktrav commented 2 years ago

I think we should close this issue, and file a separate issue about what you have found, and then link it to the new clear range failure issue once the next nightly creates it.

Ack. Will do. @dt - I'll move discussion over there 👍

If I close this, I think it will just re-open when it fails again (at least until the replica imbalance issue is addressed). I'll just make it known to the rest of the Storage folks that we can probably safely leave this one alone.

Thanks all.

cockroachdb / cockroach

roachtest: clearrange/zfs/checks=true failed #68303