cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

roachtest: change-replicas/mixed-version failed #118006

Closed cockroach-teamcity closed 8 months ago

cockroach-teamcity commented 8 months ago

roachtest.change-replicas/mixed-version failed with artifacts on release-23.2 @ 7564d506441d9f9f3930e5bb9896bdd525ea42f5:

(mixedversion.go:561).Run: mixed-version test failure while running step 52 (run "move replicas"): n2 still has 1 replicas
test artifacts and logs in: /artifacts/change-replicas/mixed-version/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-35452

kvoli commented 8 months ago

The range r455 which is still on n2, despite the [-node2] constraint:

Details

``` { "span": { "start_key": "/Table/170/1/86", "end_key": "/Table/170/1/87" }, "raft_state": { "replica_id": 24, "hard_state": { "term": 26, "vote": 24, "commit": 332 }, "lead": 24, "state": "StateLeader", "applied": 332, "progress": { "24": { "match": 332, "next": 333, "state": "StateReplicate" }, "25": { "match": 332, "next": 333, "state": "StateProbe" }, "26": { "match": 332, "next": 333, "state": "StateProbe" } } }, "state": { "state": { "raft_applied_index": 332, "lease_applied_index": 244, "desc": { "range_id": 455, "start_key": "9qqJ3g==", "end_key": "9qqJ3w==", "internal_replicas": [ { "node_id": 1, "store_id": 1, "replica_id": 26, "type": 0 }, { "node_id": 2, "store_id": 2, "replica_id": 24, "type": 0 }, { "node_id": 3, "store_id": 3, "replica_id": 25, "type": 0 } ], "next_replica_id": 27, "generation": 193, "sticky_bit": { "wall_time": 9223372036854775807, "logical": 2147483647 } }, "lease": { "start": { "wall_time": 1705734210288583265 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 24, "type": 0 }, "proposed_ts": { "wall_time": 1705734210291628521 }, "epoch": 32, "sequence": 53, "acquisition_type": 2 }, "truncated_state": { "index": 297, "term": 24 }, "gc_threshold": {}, "stats": { "contains_estimates": 0, "last_update_nanos": 1705734183437724436, "lock_age": 0, "gc_bytes_age": 0, "live_bytes": 0, "live_count": 0, "key_bytes": 0, "key_count": 0, "val_bytes": 0, "val_count": 0, "intent_bytes": 0, "intent_count": 0, "lock_bytes": 0, "lock_count": 0, "range_key_count": 0, "range_key_bytes": 0, "range_val_count": 0, "range_val_bytes": 0, "sys_bytes": 8936, "sys_count": 7, "abort_span_bytes": 0 }, "version": { "major": 21, "minor": 2, "patch": 0, "internal": 56 }, "raft_closed_timestamp": { "wall_time": 1705734207288646928 }, "raft_applied_index_term": 26, "gc_hint": { "latest_range_delete_timestamp": {}, "gc_timestamp": {}, "gc_timestamp_next": {} } }, "last_index": 332, "num_dropped": 1, "raft_log_size": 14190, "raft_log_size_trusted": true, "approximate_proposal_quota": 8388608, "proposal_quota_base_index": 332, "range_max_bytes": 67108864, "active_closed_timestamp": { "wall_time": 1705734537847945352 }, "tenant_id": 1, "closed_timestamp_sidetransport_info": { "replica_closed": { "wall_time": 1705734537847945352 }, "replica_lai": 244, "central_closed": {} } }, "source_node_id": 2, "source_store_id": 2, "lease_history": [ { "start": { "wall_time": 1705734114565182202 }, "replica": { "node_id": 4, "store_id": 4, "replica_id": 23, "type": 0 }, "proposed_ts": { "wall_time": 1705734114573537714 }, "epoch": 28, "sequence": 49, "acquisition_type": 2 }, { "start": { "wall_time": 1705734182414869115 }, "expiration": { "wall_time": 1705734188414770582 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 26, "type": 2 }, "proposed_ts": { "wall_time": 1705734182414770582 }, "sequence": 50, "acquisition_type": 1 }, { "start": { "wall_time": 1705734182414869115 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 26, "type": 2 }, "proposed_ts": { "wall_time": 1705734182415964584 }, "epoch": 32, "sequence": 51, "acquisition_type": 2 }, { "start": { "wall_time": 1705734210288583265 }, "expiration": { "wall_time": 1705734216288536613 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 24, "type": 0 }, "proposed_ts": { "wall_time": 1705734210288536613 }, "sequence": 52, "acquisition_type": 1 }, { "start": { "wall_time": 1705734210288583265 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 24, "type": 0 }, "proposed_ts": { "wall_time": 1705734210291628521 }, "epoch": 32, "sequence": 53, "acquisition_type": 2 } ], "problems": {}, "stats": { "queries_per_second": 0.012096952579674318, "requests_per_second": 0.012096952581723025, "write_bytes_per_second": 0.4415387693357099, "cpu_time_per_second": 56942.66016047397 }, "lease_status": { "lease": { "start": { "wall_time": 1705734210288583265 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 24, "type": 0 }, "proposed_ts": { "wall_time": 1705734210291628521 }, "epoch": 32, "sequence": 53, "acquisition_type": 2 }, "now": { "wall_time": 1705734541034104550 }, "request_time": { "wall_time": 1705734541034104550 }, "state": 1, "liveness": { "node_id": 2, "epoch": 32, "expiration": { "wall_time": 1705734545377662800, "logical": 0 } }, "min_valid_observed_timestamp": { "wall_time": 1705734210288583265 } }, "ticking": true, "top_k_locks_by_wait_queue_waiters": null, "locality": { "tiers": [ { "key": "cloud", "value": "gce" }, { "key": "region", "value": "us-east1" }, { "key": "zone", "value": "us-east1-b" } ] }, "is_leaseholder": true, "lease_valid": true ```

The other two replicas on n1 and n3 are in StateProbe, which could explain the failure? They are up to date with the leader however and the range is ticking, so its odd they are in StateProbe here.

There are no logs on n2 (leaseholder) after setting the constraint which reference the range.

This reminds me of https://github.com/cockroachdb/cockroach/issues/114549#issuecomment-1850741677 -- n2 was running v23.1.5 so it wouldn't include the fix https://github.com/cockroachdb/raft/pull/2.

Closing as already fixed / duplicate.

erikgrinaker commented 8 months ago

Should we set AlwaysUseLatestPredecessors for this roachtest, to avoid flakes?

kvoli commented 8 months ago

Should we set AlwaysUseLatestPredecessors for this roachtest, to avoid flakes?

Goood idea, I'll re-open this and put up a patch.