Open cockroach-teamcity opened 2 months ago
Draining is stuck:
| | ./cockroach node drain --certs-dir=certs --port={pgport:2} --self
| | ```
| | <truncated> ... emaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining... remaining: 12
| | node is draining...
| | ERROR: drain timeout, consider adjusting --drain-wait, especially under custom server.shutdown cluster settings
after being run for ~11 minutes:
2024/09/21 07:51:42 decommissionbench.go:881: test status: targeting node2 (n2) for decommission
2024/09/21 07:51:46 decommissionbench.go:936: test status: draining node2
2024/09/21 07:51:46 cluster.go:2438: running cmd `./cockroach node drain --ce...` on nodes [:2]
2024/09/21 07:51:46 cluster.go:2440: details in run_075146.727572582_n2_cockroach-node-drain.log
2024/09/21 08:02:47 test_impl.go:420: test failure #1: full stack retained in failure_1.log: (decommissionbench.go:727).runDecommissionBench: monitor failure: full command output in run_075146.727572582_n2_cockroach-node-drain.log: COMMAND_PROBLEM: exit status 1
Draining has been making steady progress w.r.t. the number of replicas per store:
But it looks like the node was retaining one leaseholder up until the test kill:
The log has plenty of:
I240921 07:54:41.698982 236576 kv/kvserver/store.go:2025 â‹® [drain] 8325 waiting for 1 replicas to transfer their lease away
In the log, we see a bunch of:
I240921 08:01:46.687772 769028 1@server/drain.go:319 â‹® [T1,Vsystem,n2] 30712 drain remaining: 12
Up until:
I240921 08:01:47.889365 776741 1@server/drain.go:319 â‹® [T1,Vsystem,n2] 30777 drain remaining: 0
E240921 08:01:47.889374 776741 1@server/drain.go:177 â‹® [T1,Vsystem,n2] 30778 drain failed: txn exec: context canceled
Which is probably the cancel of the drain. The 0 might not be a real 0.
Shortly before that:
I240921 08:01:46.033384 776037 kv/kvserver/store.go:1938 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30672 attempting to transfer lease repl=(n2,s2):4 seq=3 start=1726905235.783417588,0 epo=1 min-exp=1726905241.744482958,0 pro=1726905235.860334192,0 for range r87:‹/Table/10{7/1-8/1}› [(n2,s2):4, (n3,s3):2, (n6,s6):3, next=5, gen=11, sticky=1726907900.092995281,0]
I240921 08:01:46.033648 776037 kv/kvserver/store.go:1966 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30673 failed to transfer lease repl=(n2,s2):4 seq=3 start=1726905235.783417588,0 epo=1 min-exp=1726905241.744482958,0 pro=1726905235.860334192,0 for range r87:‹/Table/10{7/1-8/1}› [(n2,s2):4, (n3,s3):2, (n6,s6):3, next=5, gen=11, sticky=1726907900.092995281,0] when draining: no suitable transfer target found
That must be the last remaining lease that got things stuck.
There are 798 messages like this in the log, between 07:54:53.092895 - 08:01:46.033384. All for the same range ID r87
.
For r78
, there are logs like this indicating a deadlock:
I240921 08:01:44.966447 774780 kv/kvserver/store.go:1967 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30623 blocked for 173 microseconds on transfer attempt
I240921 08:01:46.033384 776037 kv/kvserver/store.go:1938 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30672 attempting to transfer lease repl=(n2,s2):4 seq=3 start=1726905235.783417588,0 epo=1 min-exp=1726905241.744482958,0 pro=1726905235.860334192,0 for range r87:‹/Table/10{7/1-8/1}› [(n2,s2):4, (n3,s3):2, (n6,s6):3, next=5, gen=11, sticky=1726907900.092995281,0]
I240921 08:01:46.033648 776037 kv/kvserver/store.go:1966 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30673 failed to transfer lease repl=(n2,s2):4 seq=3 start=1726905235.783417588,0 epo=1 min-exp=1726905241.744482958,0 pro=1726905235.860334192,0 for range r87:‹/Table/10{7/1-8/1}› [(n2,s2):4, (n3,s3):2, (n6,s6):3, next=5, gen=11, sticky=1726907900.092995281,0] when draining: no suitable transfer target found
I240921 08:01:46.033708 776037 kv/kvserver/store.go:1967 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30674 blocked for 147 microseconds on transfer attempt
I240921 08:01:46.253640 775819 kv/kvserver/txnwait/queue.go:767 ⋮ [T1,Vsystem,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30689 ‹734c3c94› breaking deadlock by force push of ‹9221cdb5›; dependencies=[‹9221cdb5›]
I240921 08:02:20.431101 804189 kv/kvserver/txnwait/queue.go:767 ⋮ [T1,Vsystem,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 32160 ‹7d4ca417› breaking deadlock by force push of ‹09347d5e›; dependencies=[‹09347d5e›]
I240921 08:02:30.990998 812748 kv/kvserver/txnwait/queue.go:767 ⋮ [T1,Vsystem,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 32585 ‹4e55de8a› breaking deadlock by force push of ‹5236ce16›; dependencies=[‹5236ce16›]
The lease transfer is stuck because it can't find a target, because the local replica is not the leader:
I240921 08:01:46.033572 776037 13@kv/kvserver/allocator/allocatorimpl/allocator.go:3039 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 7558 not considering (n3,s3):2 as a potential candidate for
a lease transfer because the replica may be waiting for a snapshot: ‹local replica not raft leader›
I240921 08:01:46.033625 776037 13@kv/kvserver/allocator/allocatorimpl/allocator.go:3039 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 7559 not considering (n6,s6):3 as a potential candidate for
a lease transfer because the replica may be waiting for a snapshot: ‹local replica not raft leader›
I240921 08:01:46.033639 776037 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2320 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 7560 no lease transfer target found for r87
To move this out of triage, I'm going to assign c-bug based off the comments above, it seems like it could be legit.
We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.
roachtest.decommissionBench/nodes=6/warehouses=1000/drain-first/while-upreplicating/target=2/multi-region failed with artifacts on release-24.2.3-rc @ 47646613d54c7a68e8c4a2468092e147461b09c5:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
/cc @cockroachdb/kv-triageThis test on roachdash | Improve this report!
Jira issue: CRDB-42401