roachtest: decommission/nodes=4/duration=1h0m0s failed [replica stuck in queue processing]

cockroach-teamcity commented 11 months ago

roachtest.decommission/nodes=4/duration=1h0m0s failed with artifacts on release-23.2 @ 935399600693f84fb746715f430ecc245934f3cd:

(decommission.go:381).runDecommission: operation "decommission" timed out after 20m0.001s (given timeout 20m0s): context deadline exceeded
test artifacts and logs in: /artifacts/decommission/nodes=4/duration=1h0m0s/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/roachtest-grafana/teamcity-12271771/decommission-nodes-4-duration-1h0m0s/1697814740954/1697817353446)

Same failure on other branches

- #110761 roachtest: decommission/nodes=4/duration=1h0m0s failed [C-bug C-test-failure O-roachtest O-robot T-kv branch-master release-blocker]

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-32581

kvoli commented 11 months ago

We bumped the logging in https://github.com/cockroachdb/cockroach/pull/111315 for these roachtests. If this is the same 1 replica "stuck" issue as #110761, we may have an easier time diagnosing.

possible decommission stall detected
n23 still has replica id 61 for range r4

id  is_live replicas    is_decommissioning  membership  is_draining
23  false   1   true    decommissioning true

Which it appears it could be.

kvoli commented 11 months ago

Appears different than #110761. Edit, possibly the same underlying issue.

n1 is the leaseholder and notices that the replica on "node4" (n23) needs to be replaced, enqueuing:

I231020 15:33:32.419045 421852 13@kv/kvserver/allocator/allocatorimpl/allocator.go:1008 ⋮ [T1,Vsystem,n1,replicate,s1,r4/54:‹/System{/tsd-tse}›] 14204  ‹replace decommissioning voter› - replacement for 1 decommissioning voters priority=5000.00
I231020 15:33:32.418873 421852 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r4/54:‹/System{/tsd-tse}›] 14203  computing range action desc=r4:‹/System{/tsd-tse}› [(n2,s2):56, (n1,s1):54, (n23,s23):61, next=62, gen=236] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1697815800000000000 > ignore_if_excluded_from_backup:true > > num_replicas:3 constraints:<constraints:<type:PROHIBITED value:"node4" > > ›
I231020 15:33:32.419063 421852 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n1,replicate,s1,r4/54:‹/System{/tsd-tse}›] 14205  repair needed (‹replace decommissioning voter›), enqueuing

Approx. 1 minute later, without any range changes, we see that n1 considers the range for rebalancing instead.

I231020 15:34:49.142831 427460 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r4/54:‹/System{/tsd-tse}›] 14387  computing range action desc=r4:‹/System{/tsd-tse}› [(n2,s2):56, (n1,s1):54, (n23,s23):61, next=62, gen=236] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1697815800000000000 > ignore_if_excluded_from_backup:true > > num_replicas:3 constraints:<constraints:<type:PROHIBITED value:"node4" > > ›
I231020 15:34:49.143268 427460 13@kv/kvserver/allocator/plan/replicate.go:200 ⋮ [T1,Vsystem,n1,replicate,s1,r4/54:‹/System{/tsd-tse}›] 14388  no rebalance target found, not enqueuing

After 5 minutes passes, n23 is considered dead and a replace dead voter action is returned.

I231020 15:39:26.227729 440688 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r4/54:‹/System{/tsd-tse}›] 15053  computing range action desc=r4:‹/System{/tsd-tse}› [(n2,s2):56, (n1,s1):54, (n23,s23):61, next=62, gen=236] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1697815800000000000 > ignore_if_excluded_from_backup:true > > num_replicas:3 constraints:<constraints:<type:PROHIBITED value:"node4" > > ›
I231020 15:39:26.227938 440688 13@kv/kvserver/allocator/allocatorimpl/allocator.go:1000 ⋮ [T1,Vsystem,n1,replicate,s1,r4/54:‹/System{/tsd-tse}›] 15054  ‹replace dead voter› - replacement for 1 dead voters priority=12000.00
I231020 15:39:26.227971 440688 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n1,replicate,s1,r4/54:‹/System{/tsd-tse}›] 15055  repair needed (‹replace dead voter›), enqueuing

What doesn't add up, is that n1 does not log processing r4, but does log enqueueing it into the replicate queue. It should be fairly straightforward to remove (n23,s23):61 and replace it with (n3:s3):62.

We can see that n3 is considered live and a valid candidate from lease transfer logging throughout.

The last time the replicate queue processed a change was @ 15:33:24, which we can tell by grepping for next replica action, logged here.

I231020 15:33:24.156824 419900 13@kv/kvserver/allocator/plan/replicate.go:266 ⋮ [T1,Vsystem,n1,replicate,s1,r17/15:‹/Table/1{4-5}›] 14160  next replica action: ‹replace decommissioning voter›

Two unexpected things to follow up on:

Why did n23's liveness status switch from decommissioning to being considered for rebalancing (non-decommissioning) between 15:33 and 15:39, that is unexpected.
Why did the replicate queue on n1 not process changes after 15:33:24.

kvoli commented 11 months ago

The TSDump shows no pending replicas and no replicas in purgatory on n1, which means the stuck replica didn't make it into the repl queue:

kvoli commented 11 months ago

Why did n23's liveness status switch from decommissioning to being considered for rebalancing (non-decommissioning) between 15:33 and 15:39, that is unexpected.

This is the issue described here https://github.com/cockroachdb/cockroach/issues/99064.

kvoli commented 10 months ago

1/20 stress runs last night failed with the same issue.

Interestingly, this only ever fails with the timeseries range (r4) having a replica stuck on the dead+decommissioning node.

The past test failures are the same, back to 23.1.

I'm suspecting something is going wrong here with purgatory/requeuing (atm looks unrelated to https://github.com/cockroachdb/cockroach/issues/113045). Branch with added logging. Similar sentiment to the last few times this test failed:

There were also no replicas in purgatory. There must be a bug somewhere because the replica doesn't seem to even be making it into the process loop of the queue - where it should pass the checks as the lease is valid, liveness info looks good on n1 (decommissioning n13) and otherwise identical to the rhs of the split which did process.

kvoli commented 10 months ago

Edited above comment.

My theory is that the bug is around the baseQueue addInternal function, which won't add a replica to the queue for processing on a few conditions, even if the queue implementation returns shouldQueue=true:

The allocator and replicate queue log lines show that the stuck range most certainly wants to be queued on the leaseholder:

I231020 15:39:26.227971 440688 13@kv/kvserver/allocator/plan/replicate.go:161 ⋮ [T1,Vsystem,n1,replicate,s1,r4/54:‹/System{/tsd-tse}›] 15055  repair needed (‹replace dead voter›), enqueuing

And the code path back to addInternal is uninterrupted. Within addInternal, I think we are hitting one of the conditions which causes the replica to not be added.

More specifically the already processing check:

https://github.com/cockroachdb/cockroach/blob/b13be530bde2ac9efea8e9b4fc9b67dc0feb8366/pkg/kv/kvserver/queue.go#L766-L771

The other checks here can be proven to not be hit. The descriptor must be initialized because it passed the initial shouldQueue. The replica is not in purgatory, because the purgatory length is 0. The queue is not disabled.

kvoli commented 10 months ago

This reproduced again with added logging overnight, the above theory holds some weight -- but how the replica/queue ended up here isn't clear yet. Same scenario plays out as above.

I231026 20:03:00.930068 824524 kv/kvserver/queue.go:773 ⋮ [T1,Vsystem,n2,replicate,s2,r4/147:‹/System{/tsd-tse}›] 267390  replica is already processing marking for requeue

cockroach-teamcity commented 10 months ago

roachtest.decommission/nodes=4/duration=1h0m0s failed with artifacts on release-23.2 @ 56fd045302ed2edc14831fb001ef77486ae8f62a:

(decommission.go:381).runDecommission: operation "decommission" timed out after 20m0.001s (given timeout 20m0s): context deadline exceeded
test artifacts and logs in: /artifacts/decommission/nodes=4/duration=1h0m0s/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_metamorphicBuild=true , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/roachtest-grafana/teamcity-12550589/decommission-nodes-4-duration-1h0m0s/1699372677606/1699377083528)

Same failure on other branches

- #110761 roachtest: decommission/nodes=4/duration=1h0m0s failed [C-test-failure O-roachtest O-robot T-kv X-duplicate branch-master]

This test on roachdash | Improve this report!

kvoli commented 10 months ago

Latest failure is again for r4 getting stuck in the same failure mode.

kvoli commented 10 months ago

The timeline that leads to a replica becoming stuck in a processing state:

process replica success
process replica fail -> purgatory (x2)
needs lease not adding (should fail purgatory processing as well)
replica ends up in a processing state

The stuck replica (r4) doesn't have an associated processing replica log line before it ends up in a processing state, and the last time it did was in purgatory processing.

I231026 19:40:27.054211 591823 kv/kvserver/queue.go:1001 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 199429  processing replica
I231026 19:40:27.054248 591823 kv/kvserver/queue.go:1035 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 199430  processing...
I231026 19:40:28.485891 591823 kv/kvserver/queue.go:1045 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 199542  processing... done
I231026 19:40:28.485926 591823 kv/kvserver/queue.go:954 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 199543  done 1.431734557s

I231026 19:40:34.106460 595208 kv/kvserver/queue.go:1001 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200025  processing replica
I231026 19:40:34.106500 595208 kv/kvserver/queue.go:1035 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200026  processing...
I231026 19:40:34.106973 595208 kv/kvserver/queue.go:954 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200027  done 535.327µs
I231026 19:40:34.106993 595208 kv/kvserver/queue.go:1189 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200028  purgatory: ‹0 of 3 live stores are able to take a new replica for the range (2 already have a voter, 0 already have a non-voter); replicas must match constraints [
{+node4}]; voting replicas must match voter_constraints []›

I231026 19:40:39.135880 594561 kv/kvserver/queue.go:1001 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200485  processing replica
I231026 19:40:39.135896 594561 kv/kvserver/queue.go:1035 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200486  processing...
I231026 19:40:39.136373 594561 kv/kvserver/queue.go:1189 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200487  purgatory: ‹0 of 3 live stores are able to take a new replica for the range (2 already have a voter, 0 already have a non-voter); replicas must match constraints [
{+node4}]; voting replicas must match voter_constraints []›

I231026 19:40:41.192641 594561 kv/kvserver/queue.go:1001 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200613  processing replica
I231026 19:40:41.192658 594561 kv/kvserver/queue.go:1035 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200614  processing...
I231026 19:40:41.193537 594561 kv/kvserver/queue.go:1189 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200615  purgatory: ‹0 of 3 live stores are able to take a new replica for the range (2 already have a voter, 0 already have a non-voter); replicas must match constraints [
{+node4}]; voting replicas must match voter_constraints []›

I231026 19:40:41.353000 597716 kv/kvserver/queue.go:763 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200624  not adding because in purgatory
I231026 19:40:42.129923 597901 kv/kvserver/queue.go:763 ⋮ [T1,Vsystem,n2,replicate,s2,r4/144:‹/System{/tsd-tse}›] 200643  not adding because in purgatory

I231026 19:40:51.977600 601275 kv/kvserver/queue.go:706 ⋮ [T1,Vsystem,n2,replicate,s2,r4/147:‹/System{/tsd-tse}›] 201134  needs lease; not adding: repl=(n3,s3):146VOTER_INCOMING seq=36 start=1698349250.493537172,0 exp=1698349256.493485811,0 pro=1698349250.493485811,0
...
I231026 19:42:53.370264 648178 kv/kvserver/queue.go:706 ⋮ [T1,Vsystem,n2,replicate,s2,r4/147:‹/System{/tsd-tse}›] 209511  needs lease; not adding: repl=(n1,s1):149 seq=38 start=1698349307.610369578,0 exp=1698349378.634242877,0 pro=1698349372.634242877,0

19:42:59 cluster.go:2258: running cmd `./cockroach node decommissi...` on nodes [:1]; details in run_194259.088476448_n1_cockroach-node-decom.log
# note there isn't a "processing replica" line between the above and here.
I231026 19:43:08.368121 653515 kv/kvserver/queue.go:773 ⋮ [T1,Vsystem,n2,replicate,s2,r4/147:‹/System{/tsd-tse}›] 210580  replica is already processing marking for requeue
# another replica gets processed from purgatory here.
I231026 19:55:16.575796 801261 kv/kvserver/queue.go:773 ⋮ [T1,Vsystem,n2,replicate,s2,r4/147:‹/System{/tsd-tse}›] 246264  replica is already processing marking for requeue

kvoli commented 10 months ago

This looks promising, when a replica is marked as processing in purgatory it isn't guaranteed to be marked as not processing eventually.

When processReplicasInPurgatory is called, every replica in purgatory is marked as processing and copied into a slice under the base queue mutex:

https://github.com/cockroachdb/cockroach/blob/c7c16fec0f376c60c8a218d5ba7bdd192b52083c/pkg/kv/kvserver/queue.go#L1259-L1270

Next (outside of the base queue mutex), the queue grabs the replica ref from the store and processes it. When the replica is destroyed, or replica ID changed, it is ignored and the process loop continues:

https://github.com/cockroachdb/cockroach/blob/c7c16fec0f376c60c8a218d5ba7bdd192b52083c/pkg/kv/kvserver/queue.go#L1273-L1276

But in order for the replica to be removed from bq.mu.replicas (or have processing=false), finishProcessingReplica must be called -- which it is not when the replica ID changes, or the replica is destroyed above.

So we end up in a state where the replica remains in bq.mu.replicas with processing=true. If a new replica, for the same range ID is added to the store, it will never be able to process through this queue, leading to the decommission stall causing this test to fail.

Draft fix PR.

cockroachdb / cockroach

roachtest: decommission/nodes=4/duration=1h0m0s failed [replica stuck in queue processing] #112761