cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.18k stars 3.82k forks source link

roachtest: decommissionBench/nodes=6/warehouses=1000/drain-first/while-upreplicating/target=2/multi-region failed #131142

Open cockroach-teamcity opened 2 months ago

cockroach-teamcity commented 2 months ago

roachtest.decommissionBench/nodes=6/warehouses=1000/drain-first/while-upreplicating/target=2/multi-region failed with artifacts on release-24.2.3-rc @ 47646613d54c7a68e8c4a2468092e147461b09c5:

(decommissionbench.go:727).runDecommissionBench: monitor failure: full command output in run_075146.727572582_n2_cockroach-node-drain.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/decommissionBench/nodes=6/warehouses=1000/drain-first/while-upreplicating/target=2/multi-region/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-42401

pav-kv commented 1 month ago

Draining is stuck:

  |   | ./cockroach node drain --certs-dir=certs --port={pgport:2} --self
  |   | ```
  |   | <truncated> ... emaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... remaining: 12
  |   | node is draining... 
  |   | ERROR: drain timeout, consider adjusting --drain-wait, especially under custom server.shutdown cluster settings

after being run for ~11 minutes:

2024/09/21 07:51:42 decommissionbench.go:881: test status: targeting node2 (n2) for decommission
2024/09/21 07:51:46 decommissionbench.go:936: test status: draining node2
2024/09/21 07:51:46 cluster.go:2438: running cmd `./cockroach node drain --ce...` on nodes [:2]
2024/09/21 07:51:46 cluster.go:2440: details in run_075146.727572582_n2_cockroach-node-drain.log
2024/09/21 08:02:47 test_impl.go:420: test failure #1: full stack retained in failure_1.log: (decommissionbench.go:727).runDecommissionBench: monitor failure: full command output in run_075146.727572582_n2_cockroach-node-drain.log: COMMAND_PROBLEM: exit status 1
pav-kv commented 1 month ago

Draining has been making steady progress w.r.t. the number of replicas per store:

Screenshot 2024-09-27 at 18 25 03

But it looks like the node was retaining one leaseholder up until the test kill:

Screenshot 2024-09-27 at 18 25 28
pav-kv commented 1 month ago

The log has plenty of:

I240921 07:54:41.698982 236576 kv/kvserver/store.go:2025 â‹® [drain] 8325  waiting for 1 replicas to transfer their lease away
pav-kv commented 1 month ago

In the log, we see a bunch of:

I240921 08:01:46.687772 769028 1@server/drain.go:319 â‹® [T1,Vsystem,n2] 30712  drain remaining: 12

Up until:

I240921 08:01:47.889365 776741 1@server/drain.go:319 â‹® [T1,Vsystem,n2] 30777  drain remaining: 0
E240921 08:01:47.889374 776741 1@server/drain.go:177 â‹® [T1,Vsystem,n2] 30778  drain failed: txn exec: context canceled

Which is probably the cancel of the drain. The 0 might not be a real 0.

pav-kv commented 1 month ago

Shortly before that:

I240921 08:01:46.033384 776037 kv/kvserver/store.go:1938 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30672  attempting to transfer lease repl=(n2,s2):4 seq=3 start=1726905235.783417588,0 epo=1 min-exp=1726905241.744482958,0 pro=1726905235.860334192,0 for range r87:‹/Table/10{7/1-8/1}› [(n2,s2):4, (n3,s3):2, (n6,s6):3, next=5, gen=11, sticky=1726907900.092995281,0]
I240921 08:01:46.033648 776037 kv/kvserver/store.go:1966 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30673  failed to transfer lease repl=(n2,s2):4 seq=3 start=1726905235.783417588,0 epo=1 min-exp=1726905241.744482958,0 pro=1726905235.860334192,0 for range r87:‹/Table/10{7/1-8/1}› [(n2,s2):4, (n3,s3):2, (n6,s6):3, next=5, gen=11, sticky=1726907900.092995281,0] when draining: no suitable transfer target found

That must be the last remaining lease that got things stuck.

There are 798 messages like this in the log, between 07:54:53.092895 - 08:01:46.033384. All for the same range ID r87.

pav-kv commented 1 month ago

For r78, there are logs like this indicating a deadlock:

I240921 08:01:44.966447 774780 kv/kvserver/store.go:1967 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30623  blocked for 173 microseconds on transfer attempt
I240921 08:01:46.033384 776037 kv/kvserver/store.go:1938 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30672  attempting to transfer lease repl=(n2,s2):4 seq=3 start=1726905235.783417588,0 epo=1 min-exp=1726905241.744482958,0 pro=1726905235.860334192,0 for range r87:‹/Table/10{7/1-8/1}› [(n2,s2):4, (n3,s3):2, (n6,s6):3, next=5, gen=11, sticky=1726907900.092995281,0]
I240921 08:01:46.033648 776037 kv/kvserver/store.go:1966 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30673  failed to transfer lease repl=(n2,s2):4 seq=3 start=1726905235.783417588,0 epo=1 min-exp=1726905241.744482958,0 pro=1726905235.860334192,0 for range r87:‹/Table/10{7/1-8/1}› [(n2,s2):4, (n3,s3):2, (n6,s6):3, next=5, gen=11, sticky=1726907900.092995281,0] when draining: no suitable transfer target found
I240921 08:01:46.033708 776037 kv/kvserver/store.go:1967 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30674  blocked for 147 microseconds on transfer attempt
I240921 08:01:46.253640 775819 kv/kvserver/txnwait/queue.go:767 ⋮ [T1,Vsystem,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 30689  ‹734c3c94› breaking deadlock by force push of ‹9221cdb5›; dependencies=[‹9221cdb5›]
I240921 08:02:20.431101 804189 kv/kvserver/txnwait/queue.go:767 ⋮ [T1,Vsystem,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 32160  ‹7d4ca417› breaking deadlock by force push of ‹09347d5e›; dependencies=[‹09347d5e›]
I240921 08:02:30.990998 812748 kv/kvserver/txnwait/queue.go:767 ⋮ [T1,Vsystem,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 32585  ‹4e55de8a› breaking deadlock by force push of ‹5236ce16›; dependencies=[‹5236ce16›]
pav-kv commented 1 month ago

The lease transfer is stuck because it can't find a target, because the local replica is not the leader:

I240921 08:01:46.033572 776037 13@kv/kvserver/allocator/allocatorimpl/allocator.go:3039 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 7558  not considering (n3,s3):2 as a potential candidate for
 a lease transfer because the replica may be waiting for a snapshot: ‹local replica not raft leader›
I240921 08:01:46.033625 776037 13@kv/kvserver/allocator/allocatorimpl/allocator.go:3039 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 7559  not considering (n6,s6):3 as a potential candidate for
 a lease transfer because the replica may be waiting for a snapshot: ‹local replica not raft leader›
I240921 08:01:46.033639 776037 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2320 ⋮ [T1,Vsystem,drain,n2,s2,r87/4:‹/Table/10{7/1-8/1}›] 7560  no lease transfer target found for r87
kvoli commented 1 month ago

To move this out of triage, I'm going to assign c-bug based off the comments above, it seems like it could be legit.

github-actions[bot] commented 1 week ago

We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.