Closed cockroach-teamcity closed 1 year ago
Test timed out with 1 violating lease:
10:33:47 lease_preferences.go:390: 1038.63s: violating(true) duration=1038.63s: [n3: 0, n4: 0, n5: 1] less-preferred(true) duration=1038.63s: [n3: 473, n4: 527, n5: 0]
10:33:47 lease_preferences.go:405: not yet meeting requirements will check again in 10s
10:33:47 test_impl.go:349: test failure #1: full stack retained in failure_1.log: (test_runner.go:1075).runTest: test timed out (30m0s)
I'll take a look at this today.
Surprisingly, the lease doesn't get picked up as violating once it arrives on n5
. For every other violating lease, which doesn't transfer away immediately:
I230730 10:16:25.168234 78081 13@kv/kvserver/replicate_queue.go:758 ⋮ [T1,n5,replicate,s5,r414/1:‹/Table/106/1/-{80071…-79886…}›] 5447 error processing replica: can't transfer r414 lease violating preferences, no suitable target
The first log line for the r702
with a violating lease occurs once the down region is considered dead:
I230730 10:22:41.257593 104986 13@kv/kvserver/replicate_queue.go:758 ⋮ [T1,n5,replicate,s5,r702/4:‹/Table/106/1/-9{20494…-18651…}›] 11678 error processing replica: ‹0 of 3 live stores are able to take a new replica for the range (3 already have a voter, 0 already have a non-voter); likely not enough nodes in cluster›
I'm going to check the lease history below to see if it lines up with stopping n1/n2.
{
"span": {
"start_key": "/Table/106/1/-9204943721096824208",
"end_key": "/Table/106/1/-9186515405338872608"
},
"raft_state": {
"replica_id": 4,
"hard_state": {
"term": 9,
"vote": 4,
"commit": 61
},
"lead": 4,
"state": "StateLeader",
"applied": 61,
"progress": {
"2": {
"match": 61,
"next": 62,
"state": "StateReplicate"
},
"4": {
"match": 61,
"next": 62,
"state": "StateReplicate"
},
"6": {
"next": 55,
"state": "StateProbe",
"paused": true
},
"7": {
"next": 55,
"state": "StateProbe",
"paused": true
},
"8": {
"match": 61,
"next": 62,
"state": "StateReplicate"
}
}
},
"state": {
"state": {
"raft_applied_index": 61,
"lease_applied_index": 37,
"desc": {
"range_id": 702,
"start_key": "8omAgEF4dJ6PunA=",
"end_key": "8omAgILw6T0fdOA=",
"internal_replicas": [
{
"node_id": 2,
"store_id": 2,
"replica_id": 6,
"type": 0
},
{
"node_id": 3,
"store_id": 3,
"replica_id": 2,
"type": 0
},
{
"node_id": 5,
"store_id": 5,
"replica_id": 4,
"type": 0
},
{
"node_id": 1,
"store_id": 1,
"replica_id": 7,
"type": 0
},
{
"node_id": 4,
"store_id": 4,
"replica_id": 8,
"type": 0
}
],
"next_replica_id": 9,
"generation": 32,
"sticky_bit": {
"wall_time": 9223372036854775807,
"logical": 2147483647
}
},
"lease": {
"start": {
"wall_time_human": "10:16:24.701",
"wall_time": 1690712184701227255
},
"replica": {
"node_id": 5,
"store_id": 5,
"replica_id": 4,
"type": 0
},
"proposed_ts": {
"wall_time": 1690712184707952378
},
"epoch": 1,
"sequence": 6,
"acquisition_type": 2
},
"truncated_state": {
"index": 10,
"term": 5
},
"gc_threshold": {},
"stats": {
"contains_estimates": 0,
"last_update_nanos": 1690711634878422403,
"intent_age": 0,
"gc_bytes_age": 0,
"live_bytes": 0,
"live_count": 0,
"key_bytes": 0,
"key_count": 0,
"val_bytes": 0,
"val_count": 0,
"intent_bytes": 0,
"intent_count": 0,
"separated_intent_count": 0,
"range_key_count": 0,
"range_key_bytes": 0,
"range_val_count": 0,
"range_val_bytes": 0,
"sys_bytes": 2000,
"sys_count": 8,
"abort_span_bytes": 0
},
"version": {
"major": 23,
"minor": 1,
"patch": 0,
"internal": 16
},
"raft_closed_timestamp": {
"wall_time": 1690713254188626489
},
"raft_applied_index_term": 9,
"gc_hint": {
"latest_range_delete_timestamp": {}
}
},
"last_index": 61,
"num_dropped": 7,
"raft_log_size": 18267,
"raft_log_size_trusted": true,
"approximate_proposal_quota": 8388608,
"proposal_quota_base_index": 61,
"range_max_bytes": 536870912,
"active_closed_timestamp": {
"wall_time": 1690713264087321153
},
"tenant_id": 1,
"closed_timestamp_sidetransport_info": {
"replica_closed": {
"wall_time": 1690713264087321153
},
"replica_lai": 37,
"central_closed": {}
}
},
"source_node_id": 5,
"source_store_id": 5,
"lease_history": [
{
"start": {
"wall_time_human": "10:04:37.713",
"wall_time": 1690711477713795632
},
"replica": {
"node_id": 2,
"store_id": 2,
"replica_id": 6,
"type": 2
},
"proposed_ts": {
"wall_time": 1690711477790866994
},
"epoch": 1,
"sequence": 3,
"acquisition_type": 2
},
{
"start": {
"wall_time_human": "10:15:45.744",
"wall_time": 1690712145744202269
},
"expiration": {
"wall_time": 1690712151744064898
},
"replica": {
"node_id": 1,
"store_id": 1,
"replica_id": 7,
"type": 0
},
"proposed_ts": {
"wall_time": 1690712145744064898
},
"sequence": 4,
"acquisition_type": 1
},
{
"start": {
"wall_time_human": "10:15:45.744",
"wall_time": 1690712145744202269
},
"replica": {
"node_id": 1,
"store_id": 1,
"replica_id": 7,
"type": 0
},
"proposed_ts": {
"wall_time": 1690712145745634543
},
"epoch": 1,
"sequence": 5,
"acquisition_type": 2
},
{
"start": {
"wall_time_human": "10:16:24.701",
"wall_time": 1690712184701227255
},
"replica": {
"node_id": 5,
"store_id": 5,
"replica_id": 4,
"type": 0
},
"proposed_ts": {
"wall_time": 1690712184701227255
},
"epoch": 1,
"sequence": 6,
"acquisition_type": 2
},
{
"start": {
"wall_time_human": "10:16:24.701",
"wall_time": 1690712184701227255
},
"replica": {
"node_id": 5,
"store_id": 5,
"replica_id": 4,
"type": 0
},
"proposed_ts": {
"wall_time": 1690712184707952378
},
"epoch": 1,
"sequence": 6,
"acquisition_type": 2
}
],
"problems": {
"underreplicated": true
},
"stats": {
"queries_per_second": 0.005583400253163264,
"writes_per_second": 0.005583400251001841,
"requests_per_second": 0.007444533668307271,
"write_bytes_per_second": 0.6662857632756795,
"cpu_time_per_second": 3182.9959820788063
},
"lease_status": {
"lease": {
"start": {
"wall_time_human": "10:16:24.701",
"wall_time": 1690712184701227255
},
"replica": {
"node_id": 5,
"store_id": 5,
"replica_id": 4,
"type": 0
},
"proposed_ts": {
"wall_time": 1690712184707952378
},
"epoch": 1,
"sequence": 6,
"acquisition_type": 2
},
"now": {
"wall_time": 1690713267096521435
},
"request_time": {
"wall_time": 1690713267096521435
},
"state": 1,
"liveness": {
"node_id": 5,
"epoch": 1,
"expiration": {
"wall_time": 1690713271495982948,
"logical": 0
}
},
"min_valid_observed_timestamp": {
"wall_time": 1690712184701227255
}
},
"quiescent": true,
"top_k_locks_by_wait_queue_waiters": null,
"locality": {
"tiers": [
{
"key": "region",
"value": "fake-region"
},
{
"key": "zone",
"value": "fake-zone"
},
{
"key": "dc",
"value": "3"
}
]
},
"is_leaseholder": true,
"lease_valid": true
}
The replica certainly did not go to purgatory, as it should have:
It also doesn't seem like it even got eagerly enqueued into the replicate queue, which it should have here:
Because we don't see any log lines referencing the replica until much later.
I got a repro. Some lease acquisitions after stopping 2 nodes didn't hit:
Which is surprising, but it means that they weren't considered violating on acquisition.
Slow span config propagation?
Slow span config propagation?
That's what I suspected, however I didn't bump logging enough to be certain.
I bumped logging further and am stressing now for another hit.
Seems more likely that the lease was invalid when hitting the check. Enqueuing the replica on unknown
(invalid lease or not leaseholder) would resolve this case.
Once a span config update arrives, the replica should be queued into the replicate queue. It's possible the async maybe add limit prevented this from occurring but thats far fetched.
Right, that's possible -- we should log a warning in that case:
Yup looks like it was due to invalid lease causing the preference status to be unknown. This is from a repro, on a range which got its lease stuck:
I230801 13:31:23.484127 480 kv/kvserver/replica_proposal.go:526 ⋮ [T1,n5,s5,r150/3:‹/Table/106/1/-4{65734…-47470…}›,raft] 1193 acquired_lease=repl=(n5,s5):3 seq=10 start=1690896681.752762586,0 epo=1 pro=1690896681.752762586,0 preferences=[{[‹+dc=1›]} {[‹+dc=2›]}] lease_preference_status=‹unknown›
roachtest.lease-preferences/full-first-preference-down failed with artifacts on master @ f295bd861a3a427652b19c2254d2401ebb4a3c8e:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/p/roachfana//1690711427218/1690713270780)
/cc @cockroachdb/kv-triage
This test on roachdash | Improve this report!
Jira issue: CRDB-30228