Closed cockroach-teamcity closed 6 months ago
I'll be cautious and mark this as a GA-blocker to understand whether the LOQ tool should work with splits or not. Otherwise, I think we just need to disable the split queue for this test to prevent flakes.
I think this test is invalid. We specifically don't guarantee the LOQ tooling will work if we lose system ranges (particularly meta2). I'm surprised this hasn't flaked more than this.
There are 2 options: 1) Attempt to stop any changes that cause changes to range descriptors. This may help, but there is still a risk that something else will cause this to flake 2) Don't cause a LOQ for any system ranges. This could be done by creating 5 nodes, pinning the scratch range to 3 nodes, and then taking down 2 of those nodes. This will leave the system ranges available and tests the case where we expect it to always work.
Thoughts?
The generated plan has the following at the end:
...
range r66:/Table/64 updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n2,s2):3,(n3,s3):2].
range r67:/Table/65 updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n3,s3):3,(n2,s2):2].
Discovered dead nodes, will be marked as decommissioned:
n2, n3
Found replica inconsistencies:
range has unapplied split operation
r67, /{Table/65-Max} rhs r68, /{Table/Max-Max}
Only proceed as a last resort!
ERROR: can not create plan because of errors and no --force flag is given
In contrast, a successful test run generates this:
...
range r67:/Table/65 updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n2,s2):3,(n3,s3):2].
range r68:/Table/Max updating replica (n1,s1):1 to (n1,s1):14. Discarding available replicas: [], discarding dead replicas: [(n2,s2):3,(n3,s3):2].
Discovered dead nodes, will be marked as decommissioned:
n2, n3
Plan created.
To stage recovery application in half-online mode invoke:
cockroach debug recover apply-plan --host <node-hostname>[:<port>] [--certs-dir <certificates-dir>|--insecure] recovery-plan.json
Alternatively distribute plan to below nodes and invoke 'debug recover apply-plan --store=<store-dir> recovery-plan.json' on:
- node n1, store(s) s1
The LOQ tooling documentation say that we can't recover if "Losing replica state due to a range merge or range split that happened at almost the exact moment that a node failed".
Looking at the list of ranges, in a successful run we have 68, with r68
being /Table/Max
(the scratch range start key; we create this scratch range at the beginning of the test). In the failed run, we still have 67 ranges. Potentially the last split just failed to complete timely.
We can probably wait for the split to complete on all 3 replicas, before turning down the cluster.
Will close this with the above PR which addresses the failure message here.
In the future, if this test fails again we may need to disable queues and maybe more:
I think you need ReplicationManual to disable the split queue. We can consider switching to it if this test fails again. Separately, we could maybe also use WaitForZoneConfigPropagation and WaitForFullReplication before shutting off the nodes.
cli.TestLossOfQuorumRecovery failed on master @ 7cea6a90eed456c76dfa07e618c3cd2257b302e5:
Parameters:
attempt=1
race=true
run=2
shard=12
Help
See also: How To Investigate a Go Test Failure (internal)
/cc @cockroachdb/kv @cockroachdb/serverThis test on roachdash | Improve this report!
Jira issue: CRDB-37327