cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

roachtest: failover/partial/lease-leader failed #109555

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 1 year ago

roachtest.failover/partial/lease-leader failed with artifacts on master @ 9ba2d93854900a9f6f9a3c09519d8d88fe6d7675:

(test_runner.go:1099).runTest: test timed out (30m0s)
(assertions.go:333).Fail: 
    Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1668
                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:562
                                main/pkg/cmd/roachtest/test_runner.go:1084
                                GOROOT/src/runtime/asm_amd64.s:1594
    Error:          Received unexpected error:
                    dial tcp 34.75.213.55:26257: connect: connection refused
    Test:           failover/partial/lease-leader
(require.go:1360).NoError: FailNow called
(cluster.go:2139).Run: context canceled
test artifacts and logs in: /artifacts/failover/partial/lease-leader/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/p/roachfana/teamcity-11499554-1693028847-78-n7cpu2/1693066513283/1693068384998)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-30985

shralex commented 1 year ago

@erikgrinaker assigning to you for now, since you probably have the most context on this one

erikgrinaker commented 1 year ago

The test setup is failing to move a replica off of n5 until the test times out:

16:18:18 failover.go:1674: moving 1 ranges off of n5 (database_name != 'kv')
16:18:19 failover.go:1674: moving 1 ranges off of n5 (database_name != 'kv')
16:18:20 failover.go:1674: moving 1 ranges off of n5 (database_name != 'kv')
...

It's trying to move all system ranges from nodes 4,5,6 to nodes 1,2,3:

https://github.com/cockroachdb/cockroach/blob/704e6e958103f7575dc64dc656408c60170fe197/pkg/cmd/roachtest/tests/failover.go#L562

It does so by trying to find a node in 1,2,3 that doesn't already have a replica:

https://github.com/cockroachdb/cockroach/blob/704e6e958103f7575dc64dc656408c60170fe197/pkg/cmd/roachtest/tests/failover.go#L1665-L1685

However, by looking at ranges.json on n5, we find that the only system range (r18) has RF=5 and already has replicas on all of 1,2,3 so there's no free target to move to:

ranges.json for r18 on n5 ```json { "span": { "start_key": "/Table/15", "end_key": "/Table/16" }, "raft_state": { "replica_id": 5, "hard_state": { "term": 8, "vote": 7, "commit": 742 }, "lead": 7, "state": "StateFollower", "applied": 742, "progress": null }, "state": { "state": { "raft_applied_index": 742, "lease_applied_index": 716, "desc": { "range_id": 18, "start_key": "lw==", "end_key": "mA==", "internal_replicas": [ { "node_id": 1, "store_id": 1, "replica_id": 7, "type": 0 }, { "node_id": 2, "store_id": 2, "replica_id": 2, "type": 0 }, { "node_id": 3, "store_id": 3, "replica_id": 3, "type": 0 }, { "node_id": 6, "store_id": 6, "replica_id": 4, "type": 0 }, { "node_id": 5, "store_id": 5, "replica_id": 5, "type": 0 } ], "next_replica_id": 8, "generation": 16, "sticky_bit": {} }, "lease": { "start": { "wall_time": 1693066696959330376 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 7, "type": 2 }, "proposed_ts": { "wall_time": 1693066696961704865 }, "epoch": 1, "sequence": 5, "acquisition_type": 2 }, "truncated_state": { "index": 682, "term": 7 }, "gc_threshold": {}, "stats": { "contains_estimates": 0, "last_update_nanos": 1693067156597889611, "intent_age": 0, "gc_bytes_age": 17791055, "live_bytes": 20404, "live_count": 360, "key_bytes": 32420, "key_count": 582, "val_bytes": 20333, "val_count": 1160, "intent_bytes": 0, "intent_count": 0, "separated_intent_count": 0, "range_key_count": 0, "range_key_bytes": 0, "range_val_count": 0, "range_val_bytes": 0, "sys_bytes": 1564, "sys_count": 7, "abort_span_bytes": 0 }, "version": { "major": 23, "minor": 1, "patch": 0, "internal": 22 }, "raft_closed_timestamp": { "wall_time": 1693067153598154741 }, "raft_applied_index_term": 8, "gc_hint": { "latest_range_delete_timestamp": {} } }, "last_index": 742, "raft_log_size": 50615, "range_max_bytes": 536870912, "active_closed_timestamp": { "wall_time": 1693068340946465964 }, "tenant_id": 1, "closed_timestamp_sidetransport_info": { "replica_closed": { "wall_time": 1693068340946465964 }, "replica_lai": 716, "central_closed": { "wall_time": 1693068340946465964 }, "central_lai": 716 } }, "source_node_id": 5, "source_store_id": 5, "lease_history": [ { "start": {}, "replica": { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1693066542779275936 }, "epoch": 1, "sequence": 1, "acquisition_type": 2 }, { "start": { "wall_time": 1693066558720744540 }, "expiration": { "wall_time": 1693066564720692905 }, "replica": { "node_id": 4, "store_id": 4, "replica_id": 6, "type": 2 }, "proposed_ts": { "wall_time": 1693066558720692905 }, "sequence": 2, "acquisition_type": 1 }, { "start": { "wall_time": 1693066558720744540 }, "replica": { "node_id": 4, "store_id": 4, "replica_id": 6, "type": 2 }, "proposed_ts": { "wall_time": 1693066558723190455 }, "epoch": 1, "sequence": 3, "acquisition_type": 2 }, { "start": { "wall_time": 1693066696959330376 }, "expiration": { "wall_time": 1693066702959277128 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 7, "type": 2 }, "proposed_ts": { "wall_time": 1693066696959277128 }, "sequence": 4, "acquisition_type": 1 }, { "start": { "wall_time": 1693066696959330376 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 7, "type": 2 }, "proposed_ts": { "wall_time": 1693066696961704865 }, "epoch": 1, "sequence": 5, "acquisition_type": 2 } ], "problems": {}, "stats": { "queries_per_second": 0.0011246672899063947, "writes_per_second": 1.8781943741267804, "requests_per_second": 0.0022493345797191884, "cpu_time_per_second": 71224.1115891991 }, "lease_status": { "lease": { "start": { "wall_time": 1693066696959330376 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 7, "type": 2 }, "proposed_ts": { "wall_time": 1693066696961704865 }, "epoch": 1, "sequence": 5, "acquisition_type": 2 }, "now": { "wall_time": 1693068343983646753 }, "request_time": { "wall_time": 1693068343983646753 }, "state": 1, "liveness": { "node_id": 1, "epoch": 1, "expiration": { "wall_time": 1693068348620073112, "logical": 0 } }, "min_valid_observed_timestamp": {} }, "quiescent": true, "top_k_locks_by_wait_queue_waiters": null, "locality": { "tiers": [ { "key": "cloud", "value": "gce" }, { "key": "region", "value": "us-east1" }, { "key": "zone", "value": "us-east1-b" } ] }, "lease_valid": true }, ```

However, the zone configs specify RF=3 for all ranges, including all system ranges:

https://github.com/cockroachdb/cockroach/blob/704e6e958103f7575dc64dc656408c60170fe197/pkg/cmd/roachtest/tests/failover.go#L537-L539

The range info is the same across all nodes. The range belongs to table 15, i.e. the jobs table. All zone configs are confirmed to run with num_replicas = 3. n1 is the leaseholder, but it doesn't appear to be attempting any operations on the range.

Throwing this back to KV to find out why the zone config isn't being applied to r18, i.e. the jobs table.

cockroach-teamcity commented 1 year ago

roachtest.failover/partial/lease-leader failed with artifacts on master @ 85cbfffeaa60c3d40e51eb6be7e49eca4dcc8a18:

(test_runner.go:1099).runTest: test timed out (30m0s)
(assertions.go:333).Fail: 
    Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1668
                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:562
                                main/pkg/cmd/roachtest/test_runner.go:1084
                                GOROOT/src/runtime/asm_amd64.s:1594
    Error:          Received unexpected error:
                    dial tcp 35.243.139.125:26257: connect: connection refused
    Test:           failover/partial/lease-leader
(require.go:1360).NoError: FailNow called
(cluster.go:2139).Run: context canceled
test artifacts and logs in: /artifacts/failover/partial/lease-leader/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/p/roachfana/teamcity-11532636-1693288026-82-n7cpu2/1693324872308/1693326731183)

This test on roachdash | Improve this report!

arulajmani commented 1 year ago

Following up from what Erik said, span configurations indicate the correct replication factor as well. Nothing is returned when running:

cat ~/Downloads/debug/system.span_configurations.txt | awk '{print $NF}' | tail -n +2 | cut -c '3-' | ./cockroach debug decode-proto --schema 'cockroach.roachpb.SpanConfig' | jq '.numReplicas' | grep 5

So span configs were reconciled correctly. I'm not sure why n1 is not trying to downreplicate r18. @kvoli would you mind having a look?

erikgrinaker commented 1 year ago

Maybe this test disables the replicate queue? I forget, and not at a computer rn.

arulajmani commented 1 year ago

You're right, it does disable the replicate queue. Interestingly, it does so after the call to WaitFor3XReplication. But looking at the code, that function name is a bit misleading; see:

https://github.com/cockroachdb/cockroach/blob/c86e1a1af107b4a870f24f46d4046d104bef1d61/pkg/cmd/roachtest/tests/util.go#L80-L102

So it seems like this function assumes we're calling it in the context of uprelication, which in this case, we aren't. I'll send out a patch.