Closed cockroach-teamcity closed 1 year ago
@erikgrinaker assigning to you for now, since you probably have the most context on this one
The test setup is failing to move a replica off of n5 until the test times out:
16:18:18 failover.go:1674: moving 1 ranges off of n5 (database_name != 'kv')
16:18:19 failover.go:1674: moving 1 ranges off of n5 (database_name != 'kv')
16:18:20 failover.go:1674: moving 1 ranges off of n5 (database_name != 'kv')
...
It's trying to move all system ranges from nodes 4,5,6 to nodes 1,2,3:
It does so by trying to find a node in 1,2,3 that doesn't already have a replica:
However, by looking at ranges.json
on n5, we find that the only system range (r18) has RF=5 and already has replicas on all of 1,2,3 so there's no free target to move to:
However, the zone configs specify RF=3 for all ranges, including all system ranges:
The range info is the same across all nodes. The range belongs to table 15, i.e. the jobs table. All zone configs are confirmed to run with num_replicas = 3. n1 is the leaseholder, but it doesn't appear to be attempting any operations on the range.
Throwing this back to KV to find out why the zone config isn't being applied to r18, i.e. the jobs table.
roachtest.failover/partial/lease-leader failed with artifacts on master @ 85cbfffeaa60c3d40e51eb6be7e49eca4dcc8a18:
(test_runner.go:1099).runTest: test timed out (30m0s)
(assertions.go:333).Fail:
Error Trace: github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1668
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:562
main/pkg/cmd/roachtest/test_runner.go:1084
GOROOT/src/runtime/asm_amd64.s:1594
Error: Received unexpected error:
dial tcp 35.243.139.125:26257: connect: connection refused
Test: failover/partial/lease-leader
(require.go:1360).NoError: FailNow called
(cluster.go:2139).Run: context canceled
test artifacts and logs in: /artifacts/failover/partial/lease-leader/run_1
Parameters: ROACHTEST_arch=amd64
, ROACHTEST_cloud=gce
, ROACHTEST_cpu=2
, ROACHTEST_encrypted=false
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/p/roachfana/teamcity-11532636-1693288026-82-n7cpu2/1693324872308/1693326731183)
Following up from what Erik said, span configurations indicate the correct replication factor as well. Nothing is returned when running:
cat ~/Downloads/debug/system.span_configurations.txt | awk '{print $NF}' | tail -n +2 | cut -c '3-' | ./cockroach debug decode-proto --schema 'cockroach.roachpb.SpanConfig' | jq '.numReplicas' | grep 5
So span configs were reconciled correctly. I'm not sure why n1 is not trying to downreplicate r18. @kvoli would you mind having a look?
Maybe this test disables the replicate queue? I forget, and not at a computer rn.
You're right, it does disable the replicate queue. Interestingly, it does so after the call to WaitFor3XReplication
. But looking at the code, that function name is a bit misleading; see:
So it seems like this function assumes we're calling it in the context of uprelication, which in this case, we aren't. I'll send out a patch.
roachtest.failover/partial/lease-leader failed with artifacts on master @ 9ba2d93854900a9f6f9a3c09519d8d88fe6d7675:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=2
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/p/roachfana/teamcity-11499554-1693028847-78-n7cpu2/1693066513283/1693068384998)
/cc @cockroachdb/kv-triage
This test on roachdash | Improve this report!
Jira issue: CRDB-30985