Closed cockroach-teamcity closed 5 years ago
Ugh, hoped we were done with this sort of thing for a while. A number of ranges simply seems to "freeze up". The SHA notably has #31689.
181023 10:11:08.714477 63680 storage/replica.go:3885 [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] proposing REMOVE_REPLICA((n9,s4):4): updated=[(n23,s23):1 (n16,s16):2 (n25,s25):3 (n5,s15):5] next=6
181023 10:11:08.714477 63680 storage/replica.go:3885 [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] proposing REMOVE_REPLICA((n9,s4):4): updated=[(n23,s23):1 (n16,s16):2 (n25,s25):3 (n5,s15):5] next=6
181023 10:11:09.242351 100766 storage/store.go:2580 [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removing replica r2159/4
181023 10:11:09.242351 100766 storage/store.go:2580 [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removing replica r2159/4
181023 10:11:09.319611 100766 storage/replica.go:863 [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removed 267187 (267179+8) keys in 73ms [clear=1ms commit=72ms]
181023 10:11:09.319611 100766 storage/replica.go:863 [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removed 267187 (267179+8) keys in 73ms [clear=1ms commit=72ms]
181023 10:12:09.079428 63831 storage/replica.go:3346 [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:09.079428 63831 storage/replica.go:3346 [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:30.130245 66387 storage/replica_command.go:816 [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas (REMOVE_REPLICA (n25,s25):3): read existing descriptor r2159:/Table/53/1/369{114403/4-382151/3} [(n23,s23):1, (n16,s16):2, (n25,s25):3, (n5,s15):5, next=6, gen=2]
181023 10:12:30.130245 66387 storage/replica_command.go:816 [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas (REMOVE_REPLICA (n25,s25):3): read existing descriptor r2159:/Table/53/1/369{114403/4-382151/3} [(n23,s23):1, (n16,s16):2, (n25,s25):3, (n5,s15):5, next=6, gen=2]
181023 10:13:30.130222 66387 storage/queue.go:791 [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas of r2159 failed: aborted while in command queue: context deadline exceeded
181023 10:13:30.130222 66387 storage/queue.go:791 [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas of r2159 failed: aborted while in command queue: context deadline exceeded
181023 10:14:32.058314 67379 storage/queue.go:791 [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}]: unable to transfer lease to s25: context deadline exceeded
181023 10:14:32.058314 67379 storage/queue.go:791 [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}]: unable to transfer lease to s25: context deadline exceeded
181023 10:15:13.602238 73950 storage/store.go:2580 [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removing replica r2159/0
181023 10:15:13.602238 73950 storage/store.go:2580 [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removing replica r2159/0
181023 10:15:13.638081 73950 storage/replica.go:863 [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removed 8 (0+8) keys in 35ms [clear=0ms commit=35ms]
181023 10:15:13.638081 73950 storage/replica.go:863 [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removed 8 (0+8) keys in 35ms [clear=0ms commit=35ms]
181023 10:16:18.923507 68830 storage/replica.go:1546 [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s attempting to acquire lease
181023 10:16:18.923507 68830 storage/replica.go:1546 [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s attempting to acquire lease
However, this seems across the board, for multiple tests at once:
$ rg -a --no-line-number --no-filename --no-heading 'have been waiting' *.logs | cut -c 2- | sort
181023 10:12:09.079428 63831 storage/replica.go:3346 [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:09.079428 63831 storage/replica.go:3346 [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:35.476713 59095 storage/replica.go:3346 [n6,s12,r900/3:/Table/53/{1/57795…-2/15679…}] have been waiting 1m0s for proposing command [txn: fae558a9], ConditionalPut [/Local/Range/Table/53/1/578185447/5/RangeDescriptor,/Min)
181023 10:12:35.476713 59095 storage/replica.go:3346 [n6,s12,r900/3:/Table/53/{1/57795…-2/15679…}] have been waiting 1m0s for proposing command [txn: fae558a9], ConditionalPut [/Local/Range/Table/53/1/578185447/5/RangeDescriptor,/Min)
181023 10:12:54.974354 62435 storage/replica.go:3346 [n4,s7,r2605/1:/Table/53/1/43{19763…-22439…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/431976384/6,/Min)
181023 10:12:54.974354 62435 storage/replica.go:3346 [n4,s7,r2605/1:/Table/53/1/43{19763…-22439…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/431976384/6,/Min)
181023 10:12:59.372869 61940 storage/replica.go:3346 [n27,s27,r450/3:/Table/53/1/502{41091…-67850…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/502410912/5,/Min)
181023 10:12:59.372869 61940 storage/replica.go:3346 [n27,s27,r450/3:/Table/53/1/502{41091…-67850…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/502410912/5,/Min)
181023 10:13:01.338531 64853 storage/replica.go:3346 [n32,s32,r2528/1:/Table/53/1/511{47459…-74131…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/511474593/1,/Min)
181023 10:13:01.338531 64853 storage/replica.go:3346 [n32,s32,r2528/1:/Table/53/1/511{47459…-74131…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/511474593/1,/Min)
181023 10:13:19.530750 60820 storage/replica.go:3346 [n6,s12,r2861/3:/Table/53/1/441{44128…-70851…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/441441284/2,/Min)
181023 10:13:19.530750 60820 storage/replica.go:3346 [n6,s12,r2861/3:/Table/53/1/441{44128…-70851…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/441441284/2,/Min)
181023 10:13:37.079546 154081 storage/replica.go:3346 [n7,s10,r2630/4:/Table/53/1/5{699299…-701968…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/569929990/8,/Min)
181023 10:13:37.079546 154081 storage/replica.go:3346 [n7,s10,r2630/4:/Table/53/1/5{699299…-701968…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/569929990/8,/Min)
[...]
I guess the only way to really make progress on this is a live repro. Going to kick off a run (with fewer nodes).
At least this repro'ed right away:
goroutine 43709 [select, 13 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryExecuteWriteBatch(0xc4256b6900, 0x3170500, 0xc42a762b10, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3322 +0x961
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).executeWriteBatch(0xc4256b6900, 0x3170500, 0xc42a762b10, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3139 +0xab
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc4256b6900, 0x3170500, 0xc42a762b10, 0x6e5, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2017 +0x711
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc4256b6900, 0x3170500, 0xc42a762ae0, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1965 +0x90
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc42044d180, 0x3170500, 0xc42a762ae0, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3022 +0x60c
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc4208154a0, 0x3170500, 0xc42a762a80, 0x0, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:185 +0xdb
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x3170500, 0xc42a762a80, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:988 +0x1c1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc4207cef30, 0x3170500, 0xc42a762a80, 0x2c15619, 0x10, 0xc427e96288, 0x0, 0x0)
(among others)
Got the RangeID from the stack trace (0x6e5 = 1765)
http://35.231.122.120:26258/debug/logspy?count=10000&duration=10s&grep=r1765/ https://gist.github.com/tschottdorf/51775b0745e78e7e4bb9dd6ae35208a5
Interestingly this had resolved itself when I just looked again.
From the logspy above, I can see that it talked to replicaIDs 1, 3, 5 (there aren't any responses in the 10s window)
I181023 14:56:41.442261 125 storage/raft.go:134 [n4,s2,r1765/2:/Table/53/1/406{58851…-85561…}] raft ready (must-sync=true)
New Entry[0]: 7/403 EntryNormal [59ca2f1fd9f654d1] [104]
Outgoing Message[0]: 2->3 MsgHeartbeat Term:7 Log:0/0 Commit:26
Outgoing Message[1]: 2->5 MsgHeartbeat Term:7 Log:0/0
Outgoing Message[2]: 2->1 MsgHeartbeat Term:7 Log:0/0
Outgoing Message[3]: 2->3 MsgApp Term:7 Log:7/402 Commit:26 Entries:[7/403 EntryNormal [59ca2f1fd9f654d1] [104]]
This is interesting because that makes four members of the range in total.
The range status page shows that replicaIDs 1, 2, 5 are part of the replica now, so 3 is gone. But I'm having trouble piecing this together. We only send the MsgApp to 3, why not to the other two? We're also only sending a Commit
index to 3 in the heartbeat. Does Raft perhaps think that 1 and 5 are down? But why isn't there an MsgAppResp from 3 at least?
cc @bdarnell in case you see something I don't.
SHA: https://github.com/cockroachdb/cockroach/commits/de005ed7e08f4edd3b95b9388596e149058d13b8
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=982526&tab=buildLog
The test failed on release-2.1:
test.go:1037: test timed out (3h0m0s)
test.go:639,cluster.go:1453,import.go:122: context canceled
^- stuck raft proposals until the bitter end.
I kicked off another manual run (disabled merges) and that wedged too. Note that we somehow have four members in our replica set and two of them seem completely uninitialized, blocking forward progress.
The Raft snapshot queue is absolutely backed up:
and so are some of these other queues. My best guess as to why this is happening so suddenly is that the test is done with the "import" part and begins a restore part (where it actually splits and sends SSTs around). In the situation above, where two replicas need a snapshot, it would make sense that the range would just fall over. And it also makes sense that once you have thousands of pending Raft snapshots, there is big trouble.
logspying on one of the nodes for ,raftsnapshot,
shows that everyone wants a snapshot, but it takes ~40s to get even a single one out.
I181024 13:41:35.836885 161068 server/debug/logspy.go:182 intercepting logs with options {Count:10000 Duration:1m0s Grep:,raftsnapshot,}
I181024 13:41:41.839521 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r1151/2:/Table/53/1/305{66937…-93632…}] raft snapshot needed, enqueuing
I181024 13:41:45.345314 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r1931/2:/Table/53/1/565{42704…-69402…}] raft snapshot needed, enqueuing
I181024 13:41:46.222135 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r2204/2:/Table/53/6/8{342/11…-622/21…}] raft snapshot needed, enqueuing
I181024 13:41:50.603532 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r1949/2:/Table/53/1/53{48500…-51035…}] raft snapshot needed, enqueuing
I181024 13:41:58.765126 167166 storage/store_snapshot.go:636 [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] sending Raft snapshot 56cdd770 at applied index 25
I181024 13:42:01.120067 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r2226/2:/Table/53/6/937{0/349…-7/171…}] raft snapshot needed, enqueuing
I181024 13:42:03.176108 167166 storage/store_snapshot.go:679 [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] streamed snapshot to (n2,s2):1: kv pairs: 267151, log entries: 2, rate-limit: 8.0 MiB/sec, 39747ms
I181024 13:42:03.499741 167166 storage/queue.go:751 [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] done
I181024 13:42:03.499774 167166 storage/queue.go:667 [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] done 40.071031219s
I181024 13:42:03.499835 167499 storage/queue.go:707 [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] processing replica
I181024 13:42:03.499871 167499 storage/queue.go:745 [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] processing
I181024 13:42:03.499879 167499 storage/raft_snapshot_queue.go:94 [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] sending raft snapshot
I181024 13:42:09.005845 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r2216/2:/Table/53/6/87{09/33…-17/29…}] raft snapshot needed, enqueuing
I181024 13:42:09.882265 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r2213/2:/Table/53/6/86{85/47…-93/42…}] raft snapshot needed, enqueuing
I181024 13:42:11.638441 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r1157/2:/Table/53/1/32{08396…-11064…}] raft snapshot needed, enqueuing
I181024 13:42:14.272795 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r2127/2:/Table/53/1/56{59614…-61666…}] raft snapshot needed, enqueuing
I181024 13:42:16.901743 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r1428/2:/Table/53/1/443{68460…-95187…}] raft snapshot needed, enqueuing
I181024 13:42:19.531425 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r501/2:/Table/53/1/8{686486…-713126…}] raft snapshot needed, enqueuing
I181024 13:42:20.409178 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r1814/2:/Table/53/1/377{30771…-57437…}] raft snapshot needed, enqueuing
I181024 13:42:21.285828 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r1232/2:/Table/53/1/307{45625…-72234…}] raft snapshot needed, enqueuing
I181024 13:42:22.162619 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r873/2:/Table/53/1/19{08200…-10884…}] raft snapshot needed, enqueuing
I181024 13:42:26.543767 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r2231/2:/Table/53/6/99{58/13…-66/94…}] raft snapshot needed, enqueuing
I181024 13:42:29.172478 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r498/2:/Table/53/1/88{46665…-73388…}] raft snapshot needed, enqueuing
I181024 13:42:35.304600 185 storage/raft_snapshot_queue.go:76 [n4,raftsnapshot,s7,r1457/2:/Table/53/1/22{59905…-62578…}] raft snapshot needed, enqueuing
I181024 13:42:35.459626 167499 storage/store_snapshot.go:636 [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] sending Raft snapshot d9244af9 at applied index 20
The delay between "sending raft snapshot" and "sending Raft snapshot d9244af9 at applied index 20" of ~32s is suspicious. What's taking so long? This is a) creating the snapshot (but that's just opening an iterator) and exchanging metadata with the target store. It's not actually transferring any data yet. We see that the actual transfer is more like 5s. The 39747ms log line indicates that we do spend most of that time in sendSnapshot
so that my assumption is that we spend lots of time in
if err := stream.Send(&SnapshotRequest{Header: &header}); err != nil {
return err
}
Ah, of course, this will block on the receiver's snapshot apply semaphore:
Presumably all nodes are sending snapshots to all other nodes, so everything's completely clogged.
Ok, it seems that one very fundamental badness here indeed has to do with why we're seeing all these snapshots being necessary in the first place.
I'm going to make it a priority to look into the related truncation logic discussion in https://github.com/cockroachdb/cockroach/issues/31409#issuecomment-430193363 to see if we're doing stupid things when AddSSTable is involved (or generally).
cc @andreimatei
I'd start by looking at the replica GC queue. We send preemptive snapshots before adding replicas; this leaves a brief window between the preemptive snapshot and the finalization of of the replica change during which the replica can be GC'd and require an immediate raft snapshot. Previously, we relied on the fact that the replica GC queue was not responsive enough to make this scenario likely. Maybe now the replica GC is happening faster and is more likely to catch replicas during this vulnerable window
SHA: https://github.com/cockroachdb/cockroach/commits/0dba537ae88e495ddf29b4c347b4c30ee99bd046
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=983962&tab=buildLog
The test failed on release-2.1:
asm_amd64.s:574,panic.go:502,panic.go:63,signal_unix.go:388,log.go:172,log.go:216,cluster.go:221,cluster.go:695: runtime error: invalid memory address or nil pointer dereference
I'm starting to suspect this change: https://github.com/cockroachdb/cockroach/commit/e436e56818e7687fde7846759eeebcb0bdc78833
I had two successful runs after "fixing" (or rather, disarming) the Raft truncation bugs, but good point about the GC queue. I'll keep an eye out.
Our goal really ought to be to get these import tests passing with zero (or barely any) Raft snapshots.
SHA: https://github.com/cockroachdb/cockroach/commits/a4950e75f6592b8cbea217cf1392d8ef8961130f
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=985074&tab=buildLog
The test failed on master:
test.go:639,test.go:651: /home/agent/work/.go/bin/roachprod create teamcity-985074-import-tpch-nodes-32 -n 32 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
stderr:
stdout:
Creating cluster teamcity-985074-import-tpch-nodes-32 with 32 nodes
Unable to create cluster:
in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181004 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script786808859 --project cockroach-ephemeral]
Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
- Quota 'CPUS' exceeded. Limit: 1200.0 in region us-central1.
: exit status 1
Cleaning up...
: exit status 1
The AddSSTable requests are ~32mb each (https://github.com/cockroachdb/cockroach/blob/7ccc197b2751098c9945d6b46da6238ae5a195a4/pkg/ccl/storageccl/import.go#L34-L38) as we expected. I think with the log size tracking fixed, the behavior here is borderline ok:
What I don't understand however is why things get so bad with the bug. When the next AddSSTable comes in, the Ready
that would append to the log would usually "atomically" send out these new entries to followers. So for a truncation to actually require a snapshot, we need a follower to not be caught up on the previous entry (because then Raft can only send out the message in a later Ready
). But when the quota pool lets the second AddSSTable proceed into the Raft log, didn't we make sure that all followers are caught up?
I might be missing something or the behavior is more subtle (for example, it might be worse when ranges are newly created, or when there are additional small proposals like lease requests involved).
I'm going to see if there's something here that I can reproduce with a bit of hacking on a local 3x cluster and kv
with a large payload.
I'm running a local experiment via roachprod three nodes, where the third node has a 100ms sleep in applyRaftCommand
and the workload is
./cockroach sql --insecure -e 'alter range default configure zone using range_max_bytes = 1048471142400;' && ./bin/workload run kv --cycle-length 1 --concurrency 4 --drop --init --max-ops 100 --min-block-bytes $((1024)) --max-block-bytes $((1 << 20)) --sequential
I additionally have a diff applied that makes the tracked size of the raft log always 8mb (i.e. well above what would trigger an aggressive truncation, 4mb).
I'm watching for Raft snapshots by tailing the logs and by watching crdb_internal.gossip_alerts
.
I seem to be able to wedge the cluster in a stuck state:
The rightmost replica isn't catching up over minutes. It's clearly in need of a Raft snapshot, but this snapshot doesn't seem to receive the appropriate priority. Instead, snapshots intersecting that replica are tried in an endless loop and, of course, fail. The slow replica is sent for GC repeatedly, but of course it can't be GC'ed -- it's still in the range, but needs to catch up on the split trigger.
I181025 11:15:55.969305 6632 storage/replica_gc_queue.go:222 [n3,replicaGC,s3,r31/3:/{Table/73-Max}] not gc'able, replica is still in range descriptor: (n3,s3):3
This situation resolved itself after a few minutes, but imagine this happening with thousands of ranges at the same time.
Looking at the code that adds replicas into the raft snapshot queue, I see that the same priority (zero) is used for all.
This suggests that we could use additional smarts. For example, instead of sending a replica blocking a snapshot to GC, we should (also?) offer it to the Raft snapshot queue, ideally at a priority that's higher if we suspect that the snapshot will shrink the replica.
I can also repro a stats inconsistency (running with COCKROACH_FATAL_ON_STATS_MISMATCH=true
):
F181025 11:39:16.531669 2071 storage/replica_consistency.go:119 [n3,consistencyChecker,s3,r32/2:/{Table/55-Max}] found a delta of {false 0 0 315173140 -280215 0 36 0 16307809 3 0 0 0 0}
goroutine 2071 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000056301, 0xc000056300, 0x835e100, 0x1e)
/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:997 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x8a99f60, 0xc000000004, 0x835e142, 0x1e, 0x77, 0xc00087c800, 0x79)
/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:866 +0x8e8
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x6e5f6a0, 0xc005f7b6e0, 0x4, 0x2, 0x675e60c, 0x14, 0xc0069779e0, 0x1, 0x1)
/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x6e5f6a0, 0xc005f7b6e0, 0x1, 0x4, 0x675e60c, 0x14, 0xc0069779e0, 0x1, 0x1)
/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x6e5f6a0, 0xc005f7b6e0, 0x675e60c, 0x14, 0xc0069779e0, 0x1, 0x1)
/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:172 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc0008d5c00, 0x6e5f6a0, 0xc005f7b6e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:119 +0x6e6
github.com/cockroachdb/cockroach/pkg/storage.(*consistencyQueue).process(0xc000364260, 0x6e5f6a0, 0xc005f7b6e0, 0xc0008d5c00, 0x0, 0x0, 0x300000003)
/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go:117 +0x1d0
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processReplica(0xc0001fcb00, 0x6e5f6e0, 0xc0067923f0, 0xc0008d5c00, 0x0, 0x0)
/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:747 +0x339
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processLoop.func1.2(0x6e5f6e0, 0xc0067923f0)
/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:626 +0xb8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0001b1e60, 0x6e5f6e0, 0xc0067923f0, 0xc005622500, 0x36, 0x0, 0x0, 0xc004f3b3e0)
/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:324 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:319 +0x134
Filed https://github.com/cockroachdb/cockroach/issues/31870 about the stats inconsistency (repros on plain master).
Ok, much simpler:
./bin/workload run kv --splits 1000 --init --drop --max-ops 1
With the truncation hack that simulates overly aggressive truncations (and the delay on n3), that gives me approximately ~700 successful Raft snapshots and ~1100 failures. Ouch. With aggressive truncation and e436e56818e7687fde7846759eeebcb0bdc78833 reverted, it still gets ~700 successful snapshots and ~700 failures.
If I remove my hacks (i.e. run master without aggressive truncations or delay), I get typically <300 Raft snapshot successes and very few failures.
I'm just going to go ahead and fix the truncation problems to see what sticks after.
Experimenting some more, I can see that (on master with merges turned off) I still see Raft snapshots if I disable both the replicaGCQueue and the Raft log queue. I think this is due to the race in which the RHS of the split gets created via an incoming message from another replica that processed the split faster.
I tried to "hack" this problem away by disabling the Raft snapshot queue, but this leads to badness. I assume this is because once Raft asks for a snapshot, you have to give it one. You can't just forget about that request. Not 100% sure this is what's happening, but it would make sense.
I was able to hack this problem away by simply reporting all MsgSnap back to the raft group as completed. I opened my repro for this with a comment as a PR so that I don't lose track of it: https://github.com/cockroachdb/cockroach/pull/31875
Attempting to productionize @petermattis' patch https://github.com/cockroachdb/cockroach/compare/master...petermattis:pmattis/tpcc-9x64?expand=1 I went down a little rabbit hole wondering why my test saw unexpected raft log sizes. Turns we're mixing something up in log truncation. A TruncateRequest has a field Index
which is supposedly the first index of the resulting log (i.e. we delete only strictly smaller indexes). But it turns out that our code actually includes Index
itself in the deletion. That sounds pretty bad, but I also noticed that the Index
we compute is usually (always?) one below what we could truncate (i.e. if all followers have index 100 in our logs, we're going to send Index: 100
even though we could delete 100 itself by sending Index: 101
).
Digging with @nvanbenschoten.
Looks mostly like a stats issue. We're subtracting a little more (one entry worth) from the log size than we should. We generate the correct TruncatedState (in which the Index is the last deleted one, not the first undeleted one) by populating it with args.Index - 1
. The actual deletion happens downstream of Raft and uses the correct range. 😓
edit: PR here https://github.com/cockroachdb/cockroach/pull/31885/files
Oh wow, are you ready for another small dumpster fire? TruncateTo
for the sideloaded storage was never updated to deal with the fact that we're doing all of our writes through the RocksDB env. It seems that putting and getting from the sideloaded storage does use the env, so that works, but we're likely not cleaning these things up properly. Or maybe we are, but definitely not if you're using an in-mem engine as we are in tests.
SHA: https://github.com/cockroachdb/cockroach/commits/fa7fb35ebd4ed4a2a4d0c1692cb1560654b73ea6
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=987340&tab=buildLog
The test failed on master:
test.go:1037: test timed out (3h0m0s)
test.go:639,cluster.go:1453,import.go:122: context canceled
SHA: https://github.com/cockroachdb/cockroach/commits/5ef4d2c8621fc5465f73a96221b0bd0bc5cd27aa
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990073&tab=buildLog
The test failed on master:
test.go:1037: test timed out (3h0m0s)
test.go:639,cluster.go:1453,import.go:122: context canceled
SHA: https://github.com/cockroachdb/cockroach/commits/441d7134902e6bcc6a5abdff6b168003f19ce592
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990998&tab=buildLog
The test failed on master:
test.go:1037: test timed out (3h0m0s)
test.go:639,cluster.go:1453,import.go:122: context canceled
SHA: https://github.com/cockroachdb/cockroach/commits/109cf8705b773c0d3a1e7ab02ce63f764e101106
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=991714&tab=buildLog
The test failed on master:
test.go:1037: test timed out (3h0m0s)
test.go:639,cluster.go:1453,import.go:122: context canceled
SHA: https://github.com/cockroachdb/cockroach/commits/bbc646fc6de90b59c0253fd682667715959fb657
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=993605&tab=buildLog
The test failed on master:
test.go:1037: test timed out (3h0m0s)
test.go:639,cluster.go:1453,import.go:122: context canceled
^- this last run has #31988.
Stuck Raft requests all the way through.
SHA: https://github.com/cockroachdb/cockroach/commits/5b5738084b8cdc769d5e7973921de5cae4457380
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=995412&tab=buildLog
The test failed on master:
test.go:1037: test timed out (3h0m0s)
test.go:639,cluster.go:1461,import.go:122: context canceled
SHA: https://github.com/cockroachdb/cockroach/commits/c196a52e8ec61851e32a062c13135bbb4111eb96
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=997430&tab=buildLog
The test failed on master:
test.go:1037: test timed out (3h0m0s)
test.go:639,cluster.go:1461,import.go:122: context canceled
Based on the log around Nov 1st, I think this was fixed by #32053. (There's likely a large buildup in snapshots after that, but it gets processed fast enough for the test to pass). The last run doesn't have a buildup of snapshots at all, likely due to #32594.
SHA: https://github.com/cockroachdb/cockroach/commits/d3f39a5cb60eb9cea64635f78e25f67a435057a5
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=980026&tab=buildLog