cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.09k stars 3.8k forks source link

roachtest: import/tpch/nodes=32 failed #31732

Closed cockroach-teamcity closed 5 years ago

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/d3f39a5cb60eb9cea64635f78e25f67a435057a5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=980026&tab=buildLog

The test failed on master:
    test.go:1037: test timed out (3h0m0s)
    test.go:639,cluster.go:1453,import.go:122: context canceled
tbg commented 6 years ago

Ugh, hoped we were done with this sort of thing for a while. A number of ranges simply seems to "freeze up". The SHA notably has #31689.

181023 10:11:08.714477 63680 storage/replica.go:3885  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] proposing REMOVE_REPLICA((n9,s4):4): updated=[(n23,s23):1 (n16,s16):2 (n25,s25):3 (n5,s15):5] next=6
181023 10:11:08.714477 63680 storage/replica.go:3885  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] proposing REMOVE_REPLICA((n9,s4):4): updated=[(n23,s23):1 (n16,s16):2 (n25,s25):3 (n5,s15):5] next=6
181023 10:11:09.242351 100766 storage/store.go:2580  [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removing replica r2159/4
181023 10:11:09.242351 100766 storage/store.go:2580  [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removing replica r2159/4
181023 10:11:09.319611 100766 storage/replica.go:863  [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removed 267187 (267179+8) keys in 73ms [clear=1ms commit=72ms]
181023 10:11:09.319611 100766 storage/replica.go:863  [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removed 267187 (267179+8) keys in 73ms [clear=1ms commit=72ms]
181023 10:12:09.079428 63831 storage/replica.go:3346  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:09.079428 63831 storage/replica.go:3346  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:30.130245 66387 storage/replica_command.go:816  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas (REMOVE_REPLICA (n25,s25):3): read existing descriptor r2159:/Table/53/1/369{114403/4-382151/3} [(n23,s23):1, (n16,s16):2, (n25,s25):3, (n5,s15):5, next=6, gen=2]
181023 10:12:30.130245 66387 storage/replica_command.go:816  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas (REMOVE_REPLICA (n25,s25):3): read existing descriptor r2159:/Table/53/1/369{114403/4-382151/3} [(n23,s23):1, (n16,s16):2, (n25,s25):3, (n5,s15):5, next=6, gen=2]
181023 10:13:30.130222 66387 storage/queue.go:791  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas of r2159 failed: aborted while in command queue: context deadline exceeded
181023 10:13:30.130222 66387 storage/queue.go:791  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas of r2159 failed: aborted while in command queue: context deadline exceeded
181023 10:14:32.058314 67379 storage/queue.go:791  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}]: unable to transfer lease to s25: context deadline exceeded
181023 10:14:32.058314 67379 storage/queue.go:791  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}]: unable to transfer lease to s25: context deadline exceeded
181023 10:15:13.602238 73950 storage/store.go:2580  [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removing replica r2159/0
181023 10:15:13.602238 73950 storage/store.go:2580  [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removing replica r2159/0
181023 10:15:13.638081 73950 storage/replica.go:863  [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removed 8 (0+8) keys in 35ms [clear=0ms commit=35ms]
181023 10:15:13.638081 73950 storage/replica.go:863  [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removed 8 (0+8) keys in 35ms [clear=0ms commit=35ms]
181023 10:16:18.923507 68830 storage/replica.go:1546  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s attempting to acquire lease
181023 10:16:18.923507 68830 storage/replica.go:1546  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s attempting to acquire lease
tbg commented 6 years ago

However, this seems across the board, for multiple tests at once:

$ rg -a --no-line-number --no-filename --no-heading 'have been waiting' *.logs | cut -c 2- | sort
181023 10:12:09.079428 63831 storage/replica.go:3346  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:09.079428 63831 storage/replica.go:3346  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:35.476713 59095 storage/replica.go:3346  [n6,s12,r900/3:/Table/53/{1/57795…-2/15679…}] have been waiting 1m0s for proposing command [txn: fae558a9], ConditionalPut [/Local/Range/Table/53/1/578185447/5/RangeDescriptor,/Min)
181023 10:12:35.476713 59095 storage/replica.go:3346  [n6,s12,r900/3:/Table/53/{1/57795…-2/15679…}] have been waiting 1m0s for proposing command [txn: fae558a9], ConditionalPut [/Local/Range/Table/53/1/578185447/5/RangeDescriptor,/Min)
181023 10:12:54.974354 62435 storage/replica.go:3346  [n4,s7,r2605/1:/Table/53/1/43{19763…-22439…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/431976384/6,/Min)
181023 10:12:54.974354 62435 storage/replica.go:3346  [n4,s7,r2605/1:/Table/53/1/43{19763…-22439…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/431976384/6,/Min)
181023 10:12:59.372869 61940 storage/replica.go:3346  [n27,s27,r450/3:/Table/53/1/502{41091…-67850…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/502410912/5,/Min)
181023 10:12:59.372869 61940 storage/replica.go:3346  [n27,s27,r450/3:/Table/53/1/502{41091…-67850…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/502410912/5,/Min)
181023 10:13:01.338531 64853 storage/replica.go:3346  [n32,s32,r2528/1:/Table/53/1/511{47459…-74131…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/511474593/1,/Min)
181023 10:13:01.338531 64853 storage/replica.go:3346  [n32,s32,r2528/1:/Table/53/1/511{47459…-74131…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/511474593/1,/Min)
181023 10:13:19.530750 60820 storage/replica.go:3346  [n6,s12,r2861/3:/Table/53/1/441{44128…-70851…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/441441284/2,/Min)
181023 10:13:19.530750 60820 storage/replica.go:3346  [n6,s12,r2861/3:/Table/53/1/441{44128…-70851…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/441441284/2,/Min)
181023 10:13:37.079546 154081 storage/replica.go:3346  [n7,s10,r2630/4:/Table/53/1/5{699299…-701968…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/569929990/8,/Min)
181023 10:13:37.079546 154081 storage/replica.go:3346  [n7,s10,r2630/4:/Table/53/1/5{699299…-701968…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/569929990/8,/Min)
[...]

I guess the only way to really make progress on this is a live repro. Going to kick off a run (with fewer nodes).

tbg commented 6 years ago

At least this repro'ed right away:

goroutine 43709 [select, 13 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryExecuteWriteBatch(0xc4256b6900, 0x3170500, 0xc42a762b10, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3322 +0x961
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).executeWriteBatch(0xc4256b6900, 0x3170500, 0xc42a762b10, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3139 +0xab
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc4256b6900, 0x3170500, 0xc42a762b10, 0x6e5, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2017 +0x711
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc4256b6900, 0x3170500, 0xc42a762ae0, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1965 +0x90
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc42044d180, 0x3170500, 0xc42a762ae0, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3022 +0x60c
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc4208154a0, 0x3170500, 0xc42a762a80, 0x0, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
    /go/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:185 +0xdb
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x3170500, 0xc42a762a80, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:988 +0x1c1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc4207cef30, 0x3170500, 0xc42a762a80, 0x2c15619, 0x10, 0xc427e96288, 0x0, 0x0)

(among others)

tbg commented 6 years ago

Got the RangeID from the stack trace (0x6e5 = 1765)

http://35.231.122.120:26258/debug/logspy?count=10000&duration=10s&grep=r1765/ https://gist.github.com/tschottdorf/51775b0745e78e7e4bb9dd6ae35208a5

tbg commented 6 years ago

Interestingly this had resolved itself when I just looked again.

From the logspy above, I can see that it talked to replicaIDs 1, 3, 5 (there aren't any responses in the 10s window)

I181023 14:56:41.442261 125 storage/raft.go:134  [n4,s2,r1765/2:/Table/53/1/406{58851…-85561…}] raft ready (must-sync=true)
  New Entry[0]: 7/403 EntryNormal [59ca2f1fd9f654d1] [104]
  Outgoing Message[0]: 2->3 MsgHeartbeat Term:7 Log:0/0 Commit:26
  Outgoing Message[1]: 2->5 MsgHeartbeat Term:7 Log:0/0
  Outgoing Message[2]: 2->1 MsgHeartbeat Term:7 Log:0/0
  Outgoing Message[3]: 2->3 MsgApp Term:7 Log:7/402 Commit:26 Entries:[7/403 EntryNormal [59ca2f1fd9f654d1] [104]]

This is interesting because that makes four members of the range in total.

The range status page shows that replicaIDs 1, 2, 5 are part of the replica now, so 3 is gone. But I'm having trouble piecing this together. We only send the MsgApp to 3, why not to the other two? We're also only sending a Commit index to 3 in the heartbeat. Does Raft perhaps think that 1 and 5 are down? But why isn't there an MsgAppResp from 3 at least?

image

tbg commented 6 years ago

cc @bdarnell in case you see something I don't.

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/de005ed7e08f4edd3b95b9388596e149058d13b8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=982526&tab=buildLog

The test failed on release-2.1:
    test.go:1037: test timed out (3h0m0s)
    test.go:639,cluster.go:1453,import.go:122: context canceled
tbg commented 6 years ago

^- stuck raft proposals until the bitter end.

tbg commented 6 years ago

I kicked off another manual run (disabled merges) and that wedged too. Note that we somehow have four members in our replica set and two of them seem completely uninitialized, blocking forward progress.

image

tbg commented 6 years ago

The Raft snapshot queue is absolutely backed up:

image

and so are some of these other queues. My best guess as to why this is happening so suddenly is that the test is done with the "import" part and begins a restore part (where it actually splits and sends SSTs around). In the situation above, where two replicas need a snapshot, it would make sense that the range would just fall over. And it also makes sense that once you have thousands of pending Raft snapshots, there is big trouble.

tbg commented 6 years ago

logspying on one of the nodes for ,raftsnapshot, shows that everyone wants a snapshot, but it takes ~40s to get even a single one out.

I181024 13:41:35.836885 161068 server/debug/logspy.go:182  intercepting logs with options {Count:10000 Duration:1m0s Grep:,raftsnapshot,}
I181024 13:41:41.839521 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1151/2:/Table/53/1/305{66937…-93632…}] raft snapshot needed, enqueuing
I181024 13:41:45.345314 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1931/2:/Table/53/1/565{42704…-69402…}] raft snapshot needed, enqueuing
I181024 13:41:46.222135 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2204/2:/Table/53/6/8{342/11…-622/21…}] raft snapshot needed, enqueuing
I181024 13:41:50.603532 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1949/2:/Table/53/1/53{48500…-51035…}] raft snapshot needed, enqueuing
I181024 13:41:58.765126 167166 storage/store_snapshot.go:636  [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] sending Raft snapshot 56cdd770 at applied index 25
I181024 13:42:01.120067 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2226/2:/Table/53/6/937{0/349…-7/171…}] raft snapshot needed, enqueuing
I181024 13:42:03.176108 167166 storage/store_snapshot.go:679  [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] streamed snapshot to (n2,s2):1: kv pairs: 267151, log entries: 2, rate-limit: 8.0 MiB/sec, 39747ms
I181024 13:42:03.499741 167166 storage/queue.go:751  [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] done
I181024 13:42:03.499774 167166 storage/queue.go:667  [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] done 40.071031219s
I181024 13:42:03.499835 167499 storage/queue.go:707  [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] processing replica
I181024 13:42:03.499871 167499 storage/queue.go:745  [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] processing
I181024 13:42:03.499879 167499 storage/raft_snapshot_queue.go:94  [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] sending raft snapshot
I181024 13:42:09.005845 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2216/2:/Table/53/6/87{09/33…-17/29…}] raft snapshot needed, enqueuing
I181024 13:42:09.882265 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2213/2:/Table/53/6/86{85/47…-93/42…}] raft snapshot needed, enqueuing
I181024 13:42:11.638441 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1157/2:/Table/53/1/32{08396…-11064…}] raft snapshot needed, enqueuing
I181024 13:42:14.272795 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2127/2:/Table/53/1/56{59614…-61666…}] raft snapshot needed, enqueuing
I181024 13:42:16.901743 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1428/2:/Table/53/1/443{68460…-95187…}] raft snapshot needed, enqueuing
I181024 13:42:19.531425 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r501/2:/Table/53/1/8{686486…-713126…}] raft snapshot needed, enqueuing
I181024 13:42:20.409178 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1814/2:/Table/53/1/377{30771…-57437…}] raft snapshot needed, enqueuing
I181024 13:42:21.285828 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1232/2:/Table/53/1/307{45625…-72234…}] raft snapshot needed, enqueuing
I181024 13:42:22.162619 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r873/2:/Table/53/1/19{08200…-10884…}] raft snapshot needed, enqueuing
I181024 13:42:26.543767 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2231/2:/Table/53/6/99{58/13…-66/94…}] raft snapshot needed, enqueuing
I181024 13:42:29.172478 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r498/2:/Table/53/1/88{46665…-73388…}] raft snapshot needed, enqueuing
I181024 13:42:35.304600 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1457/2:/Table/53/1/22{59905…-62578…}] raft snapshot needed, enqueuing
I181024 13:42:35.459626 167499 storage/store_snapshot.go:636  [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] sending Raft snapshot d9244af9 at applied index 20
tbg commented 6 years ago

The delay between "sending raft snapshot" and "sending Raft snapshot d9244af9 at applied index 20" of ~32s is suspicious. What's taking so long? This is a) creating the snapshot (but that's just opening an iterator) and exchanging metadata with the target store. It's not actually transferring any data yet. We see that the actual transfer is more like 5s. The 39747ms log line indicates that we do spend most of that time in sendSnapshot so that my assumption is that we spend lots of time in

    if err := stream.Send(&SnapshotRequest{Header: &header}); err != nil {
        return err
    }

Ah, of course, this will block on the receiver's snapshot apply semaphore:

https://github.com/cockroachdb/cockroach/blob/5a373445c0674f060a4bfe369ad290a0cacccb6c/pkg/storage/store_snapshot.go#L366

Presumably all nodes are sending snapshots to all other nodes, so everything's completely clogged.

Ok, it seems that one very fundamental badness here indeed has to do with why we're seeing all these snapshots being necessary in the first place.

I'm going to make it a priority to look into the related truncation logic discussion in https://github.com/cockroachdb/cockroach/issues/31409#issuecomment-430193363 to see if we're doing stupid things when AddSSTable is involved (or generally).

cc @andreimatei

bdarnell commented 6 years ago

I'd start by looking at the replica GC queue. We send preemptive snapshots before adding replicas; this leaves a brief window between the preemptive snapshot and the finalization of of the replica change during which the replica can be GC'd and require an immediate raft snapshot. Previously, we relied on the fact that the replica GC queue was not responsive enough to make this scenario likely. Maybe now the replica GC is happening faster and is more likely to catch replicas during this vulnerable window

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/0dba537ae88e495ddf29b4c347b4c30ee99bd046

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=983962&tab=buildLog

The test failed on release-2.1:
    asm_amd64.s:574,panic.go:502,panic.go:63,signal_unix.go:388,log.go:172,log.go:216,cluster.go:221,cluster.go:695: runtime error: invalid memory address or nil pointer dereference
benesch commented 6 years ago

I'm starting to suspect this change: https://github.com/cockroachdb/cockroach/commit/e436e56818e7687fde7846759eeebcb0bdc78833

tbg commented 6 years ago

I had two successful runs after "fixing" (or rather, disarming) the Raft truncation bugs, but good point about the GC queue. I'll keep an eye out.

Our goal really ought to be to get these import tests passing with zero (or barely any) Raft snapshots.

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/a4950e75f6592b8cbea217cf1392d8ef8961130f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=985074&tab=buildLog

The test failed on master:
    test.go:639,test.go:651: /home/agent/work/.go/bin/roachprod create teamcity-985074-import-tpch-nodes-32 -n 32 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
        stderr:

        stdout:
        Creating cluster teamcity-985074-import-tpch-nodes-32 with 32 nodes
        Unable to create cluster:
        in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181004 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script786808859 --project cockroach-ephemeral]
        Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
         - Quota 'CPUS' exceeded. Limit: 1200.0 in region us-central1.

        : exit status 1
        Cleaning up...
        : exit status 1
tbg commented 6 years ago

The AddSSTable requests are ~32mb each (https://github.com/cockroachdb/cockroach/blob/7ccc197b2751098c9945d6b46da6238ae5a195a4/pkg/ccl/storageccl/import.go#L34-L38) as we expected. I think with the log size tracking fixed, the behavior here is borderline ok:

  1. range 123 is created (empty)
  2. AddSSTable 1 (of probably 2) arrives and clears the quota pool, gets appended
  3. raftLogSize is now ~32mb
  4. AddSSTable 2/2 arrives, blocks on quota pool
  5. raft truncation queue immediately queues a truncation (using the quorum commit index instead of lowest available index. Truncation will go into the log behind the first AddSSTable, but the leader has a good chance of applying it well before one of the followers and it might apply the truncation too. I think this is probably fine because by the time that happens we've at least sent out the AddSSTable to all followers, so we're not actually going to force any of them into a snapshot.
  6. AddSSTable 1/2 and the truncation apply, quota pool releases quota only as all followers catch up (since everything is healthy)
  7. AddSSTable 2/2 clears the quota pool, the game repeats.

What I don't understand however is why things get so bad with the bug. When the next AddSSTable comes in, the Ready that would append to the log would usually "atomically" send out these new entries to followers. So for a truncation to actually require a snapshot, we need a follower to not be caught up on the previous entry (because then Raft can only send out the message in a later Ready). But when the quota pool lets the second AddSSTable proceed into the Raft log, didn't we make sure that all followers are caught up?

I might be missing something or the behavior is more subtle (for example, it might be worse when ranges are newly created, or when there are additional small proposals like lease requests involved).

I'm going to see if there's something here that I can reproduce with a bit of hacking on a local 3x cluster and kv with a large payload.

tbg commented 6 years ago

I'm running a local experiment via roachprod three nodes, where the third node has a 100ms sleep in applyRaftCommand and the workload is

./cockroach sql --insecure -e 'alter range default configure zone using range_max_bytes = 1048471142400;' && ./bin/workload run kv --cycle-length 1 --concurrency 4 --drop --init --max-ops 100 --min-block-bytes $((1024)) --max-block-bytes $((1 << 20)) --sequential

I additionally have a diff applied that makes the tracked size of the raft log always 8mb (i.e. well above what would trigger an aggressive truncation, 4mb).

I'm watching for Raft snapshots by tailing the logs and by watching crdb_internal.gossip_alerts.

I seem to be able to wedge the cluster in a stuck state:

image

The rightmost replica isn't catching up over minutes. It's clearly in need of a Raft snapshot, but this snapshot doesn't seem to receive the appropriate priority. Instead, snapshots intersecting that replica are tried in an endless loop and, of course, fail. The slow replica is sent for GC repeatedly, but of course it can't be GC'ed -- it's still in the range, but needs to catch up on the split trigger.

I181025 11:15:55.969305 6632 storage/replica_gc_queue.go:222 [n3,replicaGC,s3,r31/3:/{Table/73-Max}] not gc'able, replica is still in range descriptor: (n3,s3):3

This situation resolved itself after a few minutes, but imagine this happening with thousands of ranges at the same time.

Looking at the code that adds replicas into the raft snapshot queue, I see that the same priority (zero) is used for all.

This suggests that we could use additional smarts. For example, instead of sending a replica blocking a snapshot to GC, we should (also?) offer it to the Raft snapshot queue, ideally at a priority that's higher if we suspect that the snapshot will shrink the replica.

tbg commented 6 years ago

I can also repro a stats inconsistency (running with COCKROACH_FATAL_ON_STATS_MISMATCH=true):

F181025 11:39:16.531669 2071 storage/replica_consistency.go:119  [n3,consistencyChecker,s3,r32/2:/{Table/55-Max}] found a delta of {false 0 0 315173140 -280215 0 36 0 16307809 3 0 0 0 0}
goroutine 2071 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000056301, 0xc000056300, 0x835e100, 0x1e)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:997 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x8a99f60, 0xc000000004, 0x835e142, 0x1e, 0x77, 0xc00087c800, 0x79)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:866 +0x8e8
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x6e5f6a0, 0xc005f7b6e0, 0x4, 0x2, 0x675e60c, 0x14, 0xc0069779e0, 0x1, 0x1)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x6e5f6a0, 0xc005f7b6e0, 0x1, 0x4, 0x675e60c, 0x14, 0xc0069779e0, 0x1, 0x1)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x6e5f6a0, 0xc005f7b6e0, 0x675e60c, 0x14, 0xc0069779e0, 0x1, 0x1)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:172 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc0008d5c00, 0x6e5f6a0, 0xc005f7b6e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:119 +0x6e6
github.com/cockroachdb/cockroach/pkg/storage.(*consistencyQueue).process(0xc000364260, 0x6e5f6a0, 0xc005f7b6e0, 0xc0008d5c00, 0x0, 0x0, 0x300000003)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go:117 +0x1d0
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processReplica(0xc0001fcb00, 0x6e5f6e0, 0xc0067923f0, 0xc0008d5c00, 0x0, 0x0)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:747 +0x339
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processLoop.func1.2(0x6e5f6e0, 0xc0067923f0)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:626 +0xb8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0001b1e60, 0x6e5f6e0, 0xc0067923f0, 0xc005622500, 0x36, 0x0, 0x0, 0xc004f3b3e0)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:324 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:319 +0x134
tbg commented 6 years ago

Filed https://github.com/cockroachdb/cockroach/issues/31870 about the stats inconsistency (repros on plain master).

tbg commented 6 years ago

Ok, much simpler:

./bin/workload run kv --splits 1000 --init --drop --max-ops 1

With the truncation hack that simulates overly aggressive truncations (and the delay on n3), that gives me approximately ~700 successful Raft snapshots and ~1100 failures. Ouch. With aggressive truncation and e436e56818e7687fde7846759eeebcb0bdc78833 reverted, it still gets ~700 successful snapshots and ~700 failures.

If I remove my hacks (i.e. run master without aggressive truncations or delay), I get typically <300 Raft snapshot successes and very few failures.

I'm just going to go ahead and fix the truncation problems to see what sticks after.

tbg commented 6 years ago

Experimenting some more, I can see that (on master with merges turned off) I still see Raft snapshots if I disable both the replicaGCQueue and the Raft log queue. I think this is due to the race in which the RHS of the split gets created via an incoming message from another replica that processed the split faster.

tbg commented 6 years ago

I tried to "hack" this problem away by disabling the Raft snapshot queue, but this leads to badness. I assume this is because once Raft asks for a snapshot, you have to give it one. You can't just forget about that request. Not 100% sure this is what's happening, but it would make sense.

tbg commented 6 years ago

I was able to hack this problem away by simply reporting all MsgSnap back to the raft group as completed. I opened my repro for this with a comment as a PR so that I don't lose track of it: https://github.com/cockroachdb/cockroach/pull/31875

tbg commented 6 years ago

Attempting to productionize @petermattis' patch https://github.com/cockroachdb/cockroach/compare/master...petermattis:pmattis/tpcc-9x64?expand=1 I went down a little rabbit hole wondering why my test saw unexpected raft log sizes. Turns we're mixing something up in log truncation. A TruncateRequest has a field Index which is supposedly the first index of the resulting log (i.e. we delete only strictly smaller indexes). But it turns out that our code actually includes Index itself in the deletion. That sounds pretty bad, but I also noticed that the Index we compute is usually (always?) one below what we could truncate (i.e. if all followers have index 100 in our logs, we're going to send Index: 100 even though we could delete 100 itself by sending Index: 101).

Digging with @nvanbenschoten.

tbg commented 6 years ago

Looks mostly like a stats issue. We're subtracting a little more (one entry worth) from the log size than we should. We generate the correct TruncatedState (in which the Index is the last deleted one, not the first undeleted one) by populating it with args.Index - 1. The actual deletion happens downstream of Raft and uses the correct range. 😓

edit: PR here https://github.com/cockroachdb/cockroach/pull/31885/files

tbg commented 6 years ago

Oh wow, are you ready for another small dumpster fire? TruncateTo for the sideloaded storage was never updated to deal with the fact that we're doing all of our writes through the RocksDB env. It seems that putting and getting from the sideloaded storage does use the env, so that works, but we're likely not cleaning these things up properly. Or maybe we are, but definitely not if you're using an in-mem engine as we are in tests.

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/fa7fb35ebd4ed4a2a4d0c1692cb1560654b73ea6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=987340&tab=buildLog

The test failed on master:
    test.go:1037: test timed out (3h0m0s)
    test.go:639,cluster.go:1453,import.go:122: context canceled
cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/5ef4d2c8621fc5465f73a96221b0bd0bc5cd27aa

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990073&tab=buildLog

The test failed on master:
    test.go:1037: test timed out (3h0m0s)
    test.go:639,cluster.go:1453,import.go:122: context canceled
cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/441d7134902e6bcc6a5abdff6b168003f19ce592

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990998&tab=buildLog

The test failed on master:
    test.go:1037: test timed out (3h0m0s)
    test.go:639,cluster.go:1453,import.go:122: context canceled
cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/109cf8705b773c0d3a1e7ab02ce63f764e101106

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=991714&tab=buildLog

The test failed on master:
    test.go:1037: test timed out (3h0m0s)
    test.go:639,cluster.go:1453,import.go:122: context canceled
cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/bbc646fc6de90b59c0253fd682667715959fb657

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=993605&tab=buildLog

The test failed on master:
    test.go:1037: test timed out (3h0m0s)
    test.go:639,cluster.go:1453,import.go:122: context canceled
tbg commented 6 years ago

^- this last run has #31988.

tbg commented 6 years ago

Stuck Raft requests all the way through.

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/5b5738084b8cdc769d5e7973921de5cae4457380

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=995412&tab=buildLog

The test failed on master:
    test.go:1037: test timed out (3h0m0s)
    test.go:639,cluster.go:1461,import.go:122: context canceled
cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/c196a52e8ec61851e32a062c13135bbb4111eb96

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=997430&tab=buildLog

The test failed on master:
    test.go:1037: test timed out (3h0m0s)
    test.go:639,cluster.go:1461,import.go:122: context canceled
tbg commented 5 years ago

Based on the log around Nov 1st, I think this was fixed by #32053. (There's likely a large buildup in snapshots after that, but it gets processed fast enough for the test to pass). The last run doesn't have a buildup of snapshots at all, likely due to #32594.