Closed cockroach-teamcity closed 1 year ago
Something's pretty messed up here. Looks like we're losing r1 meta availability somehow. n1 rebalances away its r1 replica at 11:53:20:
I230704 11:53:20.598003 49673 13@kv/kvserver/allocator/plan/replicate.go:900 ⋮ [T1,n1,replicate,s1,r1/1:‹/{Min-System/NodeL…}›] 4044 rebalancing ‹voter› n1,s1 to n6,s6: ‹[1*:14752, 6:14750, 3:14750, 4:14750, 5:14742]›
I230704 11:53:20.598819 49673 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,n1,replicate,s1,r1/1:‹/{Min-System/NodeL…}›] 4045 change replicas (add [(n6,s6):7LEARNER] remove []): existing descriptor r1:‹/{Min-System/NodeLiveness}› [(n1,s1):1, (n7,s7):6, (n4,s4):3, (n9,s9):4, (n8,s8):5, next=7, gen=12]
I230704 11:53:20.607565 49673 13@kv/kvserver/replica_raft.go:396 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›] 4046 proposing SIMPLE(l7) [(n6,s6):7LEARNER]: after=[(n1,s1):1 (n7,s7):6 (n4,s4):3 (n9,s9):4 (n8,s8):5 (n6,s6):7LEARNER] next=8
I230704 11:53:20.688360 49700 13@kv/kvserver/store_snapshot.go:1596 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›] 4047 streamed INITIAL snapshot e9d45910 at applied index 14759 to (n6,s6):7LEARNER with 1.1 MiB in 0.04s @ 31 MiB/s: kvs=9203 rangeKVs=0, rate-limit: 32 MiB/s, queued: 0.04s
I230704 11:53:20.709957 49673 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,n1,replicate,s1,r1/1:‹/{Min-System/NodeL…}›] 4048 change replicas (add [(n6,s6):7VOTER_INCOMING] remove [(n1,s1):1VOTER_DEMOTING_LEARNER]): existing descriptor r1:‹/{Min-System/NodeLiveness}› [(n1,s1):1, (n7,s7):6, (n4,s4):3, (n9,s9):4, (n8,s8):5, (n6,s6):7LEARNER, next=8, gen=13]
I230704 11:53:20.714175 49673 13@kv/kvserver/replica_raft.go:396 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›] 4049 proposing ENTER_JOINT(r1 l1 v7) [(n6,s6):7VOTER_INCOMING], [(n1,s1):1VOTER_DEMOTING_LEARNER]: after=[(n1,s1):1VOTER_DEMOTING_LEARNER (n7,s7):6 (n4,s4):3 (n9,s9):4 (n8,s8):5 (n6,s6):7VOTER_INCOMING] next=8
I230704 11:53:20.735057 49673 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,n1,replicate,s1,r1/1:‹/{Min-System/NodeL…}›] 4050 change replicas (add [] remove []): existing descriptor r1:‹/{Min-System/NodeLiveness}› [(n1,s1):1VOTER_DEMOTING_LEARNER, (n7,s7):6, (n4,s4):3, (n9,s9):4, (n8,s8):5, (n6,s6):7VOTER_INCOMING, next=8, gen=14]
I230704 11:53:20.755340 49673 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,n1,replicate,s1,r1/1:‹/{Min-System/NodeL…}›] 4051 change replicas (add [] remove [(n1,s1):1LEARNER]): existing descriptor r1:‹/{Min-System/NodeLiveness}› [(n1,s1):1LEARNER, (n7,s7):6, (n4,s4):3, (n9,s9):4, (n8,s8):5, (n6,s6):7, next=8, gen=15]
I230704 11:53:20.776110 185 kv/kvserver/store_remove_replica.go:151 ⋮ [T1,n1,s1,r1/1:‹/{Min-System/NodeL…}›,raft] 579 removing replica r1/1
Then about a minute later we start seeing the first signs of r1 unavailability:
W230704 11:54:10.395697 39032 kv/kvclient/kvcoord/dist_sender.go:1808 ⋮ [T1,n1,replicaGC,s1,r393/1:‹/Table/106/1/5{88784…-92470…}›] 767 slow range RPC: have been waiting 60.00s (1 attempts) for RPC Scan [/Meta2/Table/106/1/‹5887846884665536192›/‹NULL›,/System‹/›‹"›‹"›) to r1:‹/{Min-System/NodeLiveness}› [(n1,s1):1, (n7,s7):6, (n4,s4):3, (n9,s9):4, (n8,s8):5, next=7, gen=12]; resp: ‹(err: context deadline exceeded)›
Notice that this is not 60 seconds after the replica was removed at 11:53:20, the request began around 11:53:10. The only suspect log entry we see at that time is this one:
W230704 11:53:09.815397 38115 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 506 failed to push during intent resolution: {<nil> 0 {‹0xc007ae1748›} ‹index:0 › 1688471589.815357569,0}
We keep seeing signs of r1 unavailability throughout the cluster until the test times out an hour later. However, r1 is alive and well with leader/leaseholder on n6:
"raft_state": {
"replica_id": 7,
"hard_state": {
"term": 7,
"vote": 7,
"commit": 17837
},
"lead": 7,
"state": "StateLeader",
"applied": 17837,
"progress": {
"3": {
"match": 17837,
"next": 17838,
"state": "StateReplicate"
},
"4": {
"match": 17837,
"next": 17838,
"state": "StateReplicate"
},
"5": {
"match": 17837,
"next": 17838,
"state": "StateReplicate"
},
"6": {
"match": 17837,
"next": 17838,
"state": "StateReplicate"
},
"7": {
"match": 17837,
"next": 17838,
"state": "StateReplicate"
}
}
},
"state": {
"raft_applied_index": 17837,
"lease_applied_index": 16610,
"desc": {
"range_id": 1,
"end_key": "BABsaXZlbmVzcy0=",
"internal_replicas": [
{
"node_id": 6,
"store_id": 6,
"replica_id": 7,
"type": 0
},
{
"node_id": 7,
"store_id": 7,
"replica_id": 6,
"type": 0
},
{
"node_id": 4,
"store_id": 4,
"replica_id": 3,
"type": 0
},
{
"node_id": 9,
"store_id": 9,
"replica_id": 4,
"type": 0
},
{
"node_id": 8,
"store_id": 8,
"replica_id": 5,
"type": 0
}
],
"next_replica_id": 8,
"generation": 16,
"sticky_bit": {}
},
"lease": {
"start": {
"wall_time": 1688471600725422293
},
"expiration": {
"wall_time": 1688475527720524070
},
"replica": {
"node_id": 6,
"store_id": 6,
"replica_id": 7,
"type": 0
},
"deprecated_start_stasis": {
"wall_time": 1688475527720524070
},
"proposed_ts": {
"wall_time": 1688475521720524070
},
"sequence": 2,
"acquisition_type": 2
},
We must somehow have stopped gossiping r1 or something. However, we see fresh gossip info for r1 pointing to n6, here from n1:
"first-range": {
"value": {
"raw_bytes": "7ujJvAMIARIAGgsEAGxpdmVuZXNzLSIICAYQBhgHIAAiCAgHEAcYBiAAIggIBBAEGAMgACIICAkQCRgEIAAiCAgIEAgYBSAAKAgwEDoA",
"timestamp": {
"wall_time": 1688475521220175881
}
},
"orig_stamp": 1688475521220177273,
"ttl_stamp": 9223372036854775807,
"hops": 2,
"node_id": 6,
"peer_id": 3
},
The intent resolution error appears to be a bare kvpb.Error
, for which decoding failed and so GoError()
converted it to a internalError
which simply called kvpb.Error.String()
, dispatching to the default Protobuf formatter:
Not clear if we can recover the original error, or if it's significant here.
On n6 (the new r1 leaseholder), we're seeing 9 goroutines stuck on intent resolution. None of the other nodes have this. One of them have been stuck for 65 minutes, since the outage began:
We also see these failed pushes:
n1 W230704 11:53:09.815397 38115 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 506 failed to push during intent resolution: {<nil> 0 {‹0xc007ae1748›} ‹index:0 › 1688471589.815357569,0}
n4 W230704 11:53:10.189270 15507 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 482 failed to push during intent resolution: {<nil> 0 {‹0xc0028a10f8›} ‹index:0 › 1688471590.189218613,0}
n6 W230704 11:53:21.992761 27477 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 619 failed to push during intent resolution: {<nil> 0 {‹0xc00a456820›} ‹index:0 › 1688471601.992707689,0}
n7 W230704 11:52:58.578073 4545 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 145 failed to push during intent resolution: {<nil> 0 {‹0xc003fc4508›} ‹index:3 › 1688471578.578010243,0}
n8 W230704 11:52:53.391008 2876 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 114 failed to push during intent resolution: {<nil> 0 {‹0xc000a53be8›} ‹index:0 › 1688471573.390978604,0}
n9 W230704 11:53:00.142659 5163 kv/kvserver/intentresolver/intent_resolver.go:502 ⋮ [-] 166 failed to push during intent resolution: {<nil> 0 {‹0xc0085dd198›} ‹index:0 › 1688471580.142571490,0}
I'm also seeing a bunch of slow latch acquisitions, but only on n2, across 66 ranges. This began right around the time we started having issues with r1. Unclear if it's causal or just fallout.
391:I230704 11:53:01.411830 9172 kv/kvserver/replica_command.go:439 ⋮ [T1,n2,s2,r160/6:‹/Table/106/1/5{57456…-64827…}›,*kvpb.AdminSplitRequest] 314 initiating a split of this range at key /Table/106/1/‹5611422148296262192› [r352] (‹manual›)
487:W230704 11:53:20.527384 9316 kv/kvserver/spanlatch/manager.go:568 ⋮ [T1,n2,s2,r352/1:‹/Table/106/1/56{11422…-48278…}›] 410 have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/5611422148296262192/RangeDescriptor›@0,0, held by write latch ‹/Local/Range/Table/106/1/5611422148296262192/RangeDescriptor›@0,0
It does have 34 write commands that have been replicating for 65 minutes, all of which appear to be txn heartbeats:
No replica circuit breakers were tripped.
All of the stuck ranges on n2 follow the same pattern. They have a couple of pending proposals as well as read/write latches. Here's r352's range info from n2:
``` { "span": { "start_key": "/Table/106/1/5611422148296262192", "end_key": "/Table/106/1/5648278779812165392" }, "raft_state": { "replica_id": 1, "hard_state": { "term": 6, "vote": 1, "commit": 19 }, "lead": 1, "state": "StateLeader", "applied": 19, "progress": { "1": { "match": 19, "next": 20, "state": "StateReplicate" }, "2": { "match": 19, "next": 20, "state": "StateReplicate" }, "3": { "match": 19, "next": 20, "state": "StateReplicate" }, "4": { "match": 19, "next": 20, "state": "StateReplicate" }, "5": { "next": 19, "state": "StateProbe", "paused": true } } }, "state": { "state": { "raft_applied_index": 19, "lease_applied_index": 7, "desc": { "range_id": 352, "start_key": "8on9Td/Gtpn1QjA=", "end_key": "8on9TmK3n9cUtxA=", "internal_replicas": [ { "node_id": 2, "store_id": 2, "replica_id": 1, "type": 0 }, { "node_id": 6, "store_id": 6, "replica_id": 2, "type": 0 }, { "node_id": 3, "store_id": 3, "replica_id": 3, "type": 0 }, { "node_id": 4, "store_id": 4, "replica_id": 4, "type": 0 }, { "node_id": 8, "store_id": 8, "replica_id": 5, "type": 1 } ], "next_replica_id": 6, "generation": 67, "sticky_bit": { "wall_time": 9223372036854775807, "logical": 2147483647 } }, "lease": { "start": { "wall_time": 1688471581088063261 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1688471581148598109 }, "epoch": 1, "sequence": 13, "acquisition_type": 2 }, "truncated_state": { "index": 10, "term": 5 }, "gc_threshold": {}, "stats": { "contains_estimates": 0, "last_update_nanos": 1688471582558396732, "intent_age": 0, "gc_bytes_age": 0, "live_bytes": 6600, "live_count": 200, "key_bytes": 5000, "key_count": 200, "val_bytes": 1600, "val_count": 200, "intent_bytes": 0, "intent_count": 0, "separated_intent_count": 0, "range_key_count": 0, "range_key_bytes": 0, "range_val_count": 0, "range_val_bytes": 0, "sys_bytes": 623, "sys_count": 5, "abort_span_bytes": 0 }, "version": { "major": 23, "minor": 1, "patch": 0, "internal": 10 }, "raft_closed_timestamp": { "wall_time": 1688471579558611788 }, "raft_applied_index_term": 6, "gc_hint": { "latest_range_delete_timestamp": {} } }, "last_index": 19, "num_pending": 2, "raft_log_size": 3114, "approximate_proposal_quota": 8388109, "proposal_quota_base_index": 19, "range_max_bytes": 536870912, "active_closed_timestamp": { "wall_time": 1688471585467455751 }, "tenant_id": 1, "closed_timestamp_sidetransport_info": { "replica_closed": { "wall_time": 1688471585467455751 }, "replica_lai": 7, "central_closed": {} } }, "source_node_id": 2, "source_store_id": 2, "lease_history": [ { "start": { "wall_time": 1688471581088063261 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1688471581148598109 }, "epoch": 1, "sequence": 13, "acquisition_type": 2 } ], "problems": { "underreplicated": true }, "stats": { "queries_per_second": 0.0032683483833489094, "requests_per_second": 0.0032683483839620617, "cpu_time_per_second": 605.5896575170474 }, "lease_status": { "lease": { "start": { "wall_time": 1688471581088063261 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1688471581148598109 }, "epoch": 1, "sequence": 13, "acquisition_type": 2 }, "now": { "wall_time": 1688475524035369955 }, "request_time": { "wall_time": 1688475524035369955 }, "state": 1, "liveness": { "node_id": 2, "epoch": 1, "expiration": { "wall_time": 1688475529219325917, "logical": 0 } }, "min_valid_observed_timestamp": { "wall_time": 1688471581088063261 } }, "ticking": true, "read_latches": 4, "write_latches": 2, "top_k_locks_by_wait_queue_waiters": null, "locality": { "tiers": [ { "key": "cloud", "value": "gce" }, { "key": "region", "value": "us-east1" }, { "key": "zone", "value": "us-east1-b" } ] }, "is_leaseholder": true, "lease_valid": true }, ```
Interestingly, I even found a few of these that were quiesced:
``` { "span": { "start_key": "/Table/106/1/6717121093773358192", "end_key": "/Table/106/1/6753977725289261392" }, "raft_state": { "replica_id": 5, "hard_state": { "term": 7, "vote": 5, "commit": 37 }, "lead": 5, "state": "StateLeader", "applied": 37, "progress": { "3": { "match": 37, "next": 38, "state": "StateReplicate" }, "4": { "match": 37, "next": 38, "state": "StateReplicate" }, "5": { "match": 37, "next": 38, "state": "StateReplicate" } } }, "state": { "state": { "raft_applied_index": 37, "lease_applied_index": 22, "desc": { "range_id": 255, "start_key": "8on9XTgCC8Ok9HA=", "end_key": "8on9Xbry9QDEaVA=", "internal_replicas": [ { "node_id": 2, "store_id": 2, "replica_id": 5, "type": 0 }, { "node_id": 5, "store_id": 5, "replica_id": 4, "type": 0 }, { "node_id": 3, "store_id": 3, "replica_id": 3, "type": 0 } ], "next_replica_id": 6, "generation": 60, "sticky_bit": { "wall_time": 9223372036854775807, "logical": 2147483647 } }, "lease": { "start": { "wall_time": 1688471580416803028 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 5, "type": 2 }, "proposed_ts": { "wall_time": 1688471580433292029 }, "epoch": 1, "sequence": 13, "acquisition_type": 2 }, "truncated_state": { "index": 22, "term": 6 }, "gc_threshold": {}, "stats": { "contains_estimates": 0, "last_update_nanos": 1688471581911434236, "intent_age": 0, "gc_bytes_age": 0, "live_bytes": 6600, "live_count": 200, "key_bytes": 5000, "key_count": 200, "val_bytes": 1600, "val_count": 200, "intent_bytes": 0, "intent_count": 0, "separated_intent_count": 0, "range_key_count": 0, "range_key_bytes": 0, "range_val_count": 0, "range_val_bytes": 0, "sys_bytes": 1284, "sys_count": 6, "abort_span_bytes": 0 }, "version": { "major": 23, "minor": 1, "patch": 0, "internal": 10 }, "raft_closed_timestamp": { "wall_time": 1688471578911824630 }, "raft_applied_index_term": 7, "gc_hint": { "latest_range_delete_timestamp": {} } }, "last_index": 37, "num_pending": 2, "raft_log_size": 5934, "approximate_proposal_quota": 8388025, "proposal_quota_base_index": 37, "range_max_bytes": 536870912, "active_closed_timestamp": { "wall_time": 1688472975379811219 }, "tenant_id": 1, "closed_timestamp_sidetransport_info": { "replica_closed": { "wall_time": 1688472975379811219 }, "replica_lai": 22, "central_closed": {} } }, "source_node_id": 2, "source_store_id": 2, "lease_history": [ { "start": { "wall_time": 1688471577723787541 }, "replica": { "node_id": 8, "store_id": 8, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1688471577750593265 }, "epoch": 1, "sequence": 11, "acquisition_type": 2 }, { "start": { "wall_time": 1688471580416803028 }, "expiration": { "wall_time": 1688471586415413325 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 5, "type": 2 }, "proposed_ts": { "wall_time": 1688471580415413325 }, "sequence": 12, "acquisition_type": 1 }, { "start": { "wall_time": 1688471580416803028 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 5, "type": 2 }, "proposed_ts": { "wall_time": 1688471580433292029 }, "epoch": 1, "sequence": 13, "acquisition_type": 2 } ], "problems": { "underreplicated": true }, "stats": {}, "lease_status": { "lease": { "start": { "wall_time": 1688471580416803028 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 5, "type": 2 }, "proposed_ts": { "wall_time": 1688471580433292029 }, "epoch": 1, "sequence": 13, "acquisition_type": 2 }, "now": { "wall_time": 1688475524035369955 }, "request_time": { "wall_time": 1688475524035369955 }, "state": 1, "liveness": { "node_id": 2, "epoch": 1, "expiration": { "wall_time": 1688475529219325917, "logical": 0 } }, "min_valid_observed_timestamp": { "wall_time": 1688471580416803028 } }, "quiescent": true, "read_latches": 2, "write_latches": 2, "top_k_locks_by_wait_queue_waiters": null, "locality": { "tiers": [ { "key": "cloud", "value": "gce" }, { "key": "region", "value": "us-east1" }, { "key": "zone", "value": "us-east1-b" } ] }, "is_leaseholder": true, "lease_valid": true }, ```
At this point, I'm starting to suspect proposal issues. Wdyt @tbg?
cc @cockroachdb/replication
Copying over this summary from Slack, to avoid having to piece it together from the above.
What seems to be happening, is that about 60 ranges on n2 end up with stuck proposals holding latches, without tripping the circuit breakers. Some of these latches are for transaction records owning intents on meta1. When other callers try to scan meta1, they hit the intents, and r1 on n6 attempts to resolve the intents on n2, hitting the latches owned by the stuck proposals and stalling forever.
Starting this investigation from scratch to get my bearings.
First step: look at the combined logs and find the first slow latch ("have been waiting" message), which is usually a good way to figure out when replication-related trouble began. This happens on r352, at 11:53:20
on n2, and the write latch that has been held for 15s at that point is the range descriptor (‹/Local/Range/Table/106/1/5611422148296262192/RangeDescriptor›@0,0
).
15s before that point, we see an AdminScatter
. Okay, we tried to scatter the range, and that borked it. So far, so good. From that point on there are various more slow latches, etc, so ok, it doesn't unbork and is likely wedged until the end of the test. But why?
The new thing we see at 11:57:23
is raft receive queue for r352 is full
, again on n2, and this repeats frequently throughout. This is unexpected - we don't expect any traffic on this range.
Spot checking a few other ranges, there is a pattern.
When you look at the distribution on when they got stuck, there is a very strong correlation. These are not independent events - almost all of the slow latches are logged at 11:53:2[01]
. Something got borked. In addition to this, all of the first events occur on n2. So, something happened on n2 that affected only the ranges it held the lease for.
``` $ for rid in $(cat affected.txt); do grep -E "$rid.*have been waiting.*latch" combined.log | head -n 1; done | grep -Eo 'W230704 ........' > first_affected.txt $ cat first_affected.txt | sort | uniq -c 12 W230704 11:53:20 17 W230704 11:53:21 3 W230704 11:53:23 3 W230704 11:53:24 1 W230704 11:54:30 1 W230704 11:56:30 1 W230704 11:57:30 1 W230704 11:58:30 1 W230704 11:59:30 1 W230704 12:00:30 1 W230704 12:02:30 1 W230704 12:04:30 1 W230704 12:07:30 1 W230704 12:09:30 1 W230704 12:12:30 1 W230704 12:13:30 1 W230704 12:14:30 1 W230704 12:16:30 1 W230704 12:17:30 1 W230704 12:19:30 1 W230704 12:20:30 1 W230704 12:21:30 1 W230704 12:22:30 1 W230704 12:23:30 1 W230704 12:24:30 1 W230704 12:28:30 1 W230704 12:30:30 1 W230704 12:33:30 1 W230704 12:47:30 1 W230704 12:49:30 1 W230704 12:50:30 1 W230704 12:51:30 1 W230704 12:52:30 1 W230704 12:53:30 1 W230704 12:55:30 1 W230704 12:56:30 ```
Flipping through the raft state for the affected ranges, for most of them n2 is the leader+leaseholder and all followers are in StateReplicate
.
``` for rid in $(cat affected.txt | sed 's/r//g'); do cat debug/nodes/2/ranges.json | jq ".[] | select(.state.state.desc.range_id == $rid) | .raft_state"; done | less [...] ```
We see 63 ranges report "raft receive queue is full". Notably, this is fewer than the 85 ranges we have in total, so maybe there is more than one way in which things broke down. Again, there doesn't seem to be a reason why any receive queue should fill up. After all, looking at e.g. r352, its last index is 19. There simply isn't much traffic on these ranges, and they're stuck anyway.
Couldn't the receive queues filling up simply be heartbeat responses and such?
Normally I'd suspect some kind of mutex deadlock here, but I didn't immediately see anything in the stacks. Could this be some kind of livelock scenario?
(EDIT: I originally looked at n1 and n6 stacks, where r1 was stuck, not n2)
I spoke too soon. I had been looking for mutexes on n1 and n6, since I was initially looking at a stuck r1. This is from n2:
goroutine 181 [semacquire, 65 minutes]:
sync.runtime_SemacquireMutex(0x9731ef?, 0x60?, 0xc0090e3b30?)
GOROOT/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc004536658)
GOROOT/src/sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
GOROOT/src/sync/mutex.go:90
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).tryGetReplica(0xc002ce0a80, {0x77c61f0, 0xc00102d110}, 0x3e8?, 0x2, 0xc00334b8e0)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_create_replica.go:110 +0xbe
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).tryGetOrCreateReplica(0xc002ce0a80, {0x77c61f0, 0xc00102d110}, 0x141, 0x0?, 0x0?)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_create_replica.go:173 +0x7c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).getOrCreateReplica(0xc003b11d68?, {0x77c61f0, 0xc00102d110}, 0x100c003b11d68?, 0x2, 0x576fe60?)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_create_replica.go:79 +0x256
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).withReplicaForRequest(0xc002ce13b0?, {0x77c61f0, 0xc00102d110}, 0xc00334b8c0, 0xc0044f5e28)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:346 +0x65
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processRequestQueue(0xc002ce0a80, {0x77c61f0, 0xc00102d110}, 0xc000db2120?)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:603 +0x1c9
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker(0xc000f6d1f0, {0x77c61f0, 0xc00102d110}, {0x77a9ac0, 0xc002ce0a80}, 0xc002ce0000)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:399 +0x1a7
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2({0x77c61f0?, 0xc00102d110?})
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321 +0x45
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b
That's the raftMu
. Seeing several of these.
These are all of the stuck mutex stacks. The first two are on raftMu
, the last are on Replica.mu
.
``` goroutine 181 [semacquire, 65 minutes]: sync.runtime_SemacquireMutex(0x9731ef?, 0x60?, 0xc0090e3b30?) GOROOT/src/runtime/sema.go:77 +0x25 sync.(*Mutex).lockSlow(0xc004536658) GOROOT/src/sync/mutex.go:171 +0x165 sync.(*Mutex).Lock(...) GOROOT/src/sync/mutex.go:90 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).tryGetReplica(0xc002ce0a80, {0x77c61f0, 0xc00102d110}, 0x3e8?, 0x2, 0xc00334b8e0) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_create_replica.go:110 +0xbe github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).tryGetOrCreateReplica(0xc002ce0a80, {0x77c61f0, 0xc00102d110}, 0x141, 0x0?, 0x0?) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_create_replica.go:173 +0x7c github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).getOrCreateReplica(0xc003b11d68?, {0x77c61f0, 0xc00102d110}, 0x100c003b11d68?, 0x2, 0x576fe60?) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_create_replica.go:79 +0x256 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).withReplicaForRequest(0xc002ce13b0?, {0x77c61f0, 0xc00102d110}, 0xc00334b8c0, 0xc0044f5e28) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:346 +0x65 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processRequestQueue(0xc002ce0a80, {0x77c61f0, 0xc00102d110}, 0xc000db2120?) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:603 +0x1c9 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker(0xc000f6d1f0, {0x77c61f0, 0xc00102d110}, {0x77a9ac0, 0xc002ce0a80}, 0xc002ce0000) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:399 +0x1a7 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2({0x77c61f0?, 0xc00102d110?}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321 +0x45 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2() github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146 created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b goroutine 186 [semacquire, 65 minutes]: sync.runtime_SemacquireMutex(0x9731ef?, 0x60?, 0xc00677ca80?) GOROOT/src/runtime/sema.go:77 +0x25 sync.(*Mutex).lockSlow(0xc00667f2d8) GOROOT/src/sync/mutex.go:171 +0x165 sync.(*Mutex).Lock(...) GOROOT/src/sync/mutex.go:90 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).tryGetReplica(0xc002ce0a80, {0x77c61f0, 0xc00102d5c0}, 0x3e8?, 0x2, 0xc008149360) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_create_replica.go:110 +0xbe github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).tryGetOrCreateReplica(0xc002ce0a80, {0x77c61f0, 0xc00102d5c0}, 0x16c, 0x0?, 0x0?) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_create_replica.go:173 +0x7c github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).getOrCreateReplica(0xc005d03d68?, {0x77c61f0, 0xc00102d5c0}, 0x100c005d03d68?, 0x2, 0x576fe60?) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_create_replica.go:79 +0x256 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).withReplicaForRequest(0xc002ce13b0?, {0x77c61f0, 0xc00102d5c0}, 0xc008149340, 0xc004353e28) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:346 +0x65 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processRequestQueue(0xc002ce0a80, {0x77c61f0, 0xc00102d5c0}, 0x0?) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:603 +0x1c9 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker(0xc000f6d1f0, {0x77c61f0, 0xc00102d5c0}, {0x77a9ac0, 0xc002ce0a80}, 0xc002ce0000) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:399 +0x1a7 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2({0x77c61f0?, 0xc00102d5c0?}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321 +0x45 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2() github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146 created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b goroutine 263 [semacquire, 65 minutes]: sync.runtime_SemacquireMutex(0x7794ef0?, 0x10?, 0x0?) GOROOT/src/runtime/sema.go:77 +0x25 sync.(*RWMutex).RLock(...) GOROOT/src/sync/rwmutex.go:71 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*storeReplicaVisitor).Visit(0xc001021410, 0xc002f93cc0) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:463 +0x145 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaScanner).scanLoop.func1({0x77c61f0?, 0xc000cb9950}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scanner.go:284 +0x23c github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2() github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146 created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b goroutine 6411 [semacquire, 65 minutes]: sync.runtime_SemacquireMutex(0xc0001fc640?, 0x43?, 0x190b?) GOROOT/src/runtime/sema.go:77 +0x25 sync.(*RWMutex).Lock(0x0?) GOROOT/src/sync/rwmutex.go:152 +0x71 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).addSnapshotLogTruncationConstraint.func2() github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1916 +0x4d github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendSnapshotUsingDelegate(0xc0023e9900, {0x77c61b8, 0xc00457d800}, {0x4fe9bc0?, 0xc0?, 0x0?, 0x0?}, 0x0, 0x1, 0x2, ...) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_command.go:2895 +0xadf github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSnapshotQueue).processRaftSnapshot(0xc00385baa0?, {0x77c61b8, 0xc00457d800}, 0xc0023e9900, 0x7) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_snapshot_queue.go:148 +0x305 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSnapshotQueue).process(0x0?, {0x77c61b8, 0xc00457d800}, 0x77c61b8?, {0xc00457d800?, 0xc00385bc00?}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/raft_snapshot_queue.go:96 +0x192 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processReplica.func1({0x77c61b8, 0xc00457d800}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:1027 +0x275 github.com/cockroachdb/cockroach/pkg/util/timeutil.RunWithTimeout({0x77c61f0?, 0xc004fe9b60?}, {0xc003d31260, 0x25}, 0xdf8475800, 0xc00385be20) github.com/cockroachdb/cockroach/pkg/util/timeutil/timeout.go:29 +0xdb github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processReplica(0xc0027f3340, {0x77c61f0, 0xc004fe9b30}, {0x781ac18, 0xc0023e9900}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:986 +0x48e github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*baseQueue).processLoop.func2.1({0x77c61f0, 0xc004fe9b30}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/queue.go:897 +0x117 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2() github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146 created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b ```
This smelled like some sort of deadlock, so I looked through the stacks. This one caught my eye right away:
``` goroutine 155 [select, 65 minutes]: github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaScanner).RemoveReplica(...) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scanner.go:185 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).removeInitializedReplicaRaftMuLocked(0xc002ce0a80, {0x77c61f0, 0xc0051e2240}, 0xc008d69200, 0x7, {0x0?, 0x0?}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_remove_replica.go:227 +0xa45 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleChangeReplicasResult(0xc008d69200, {0x77c61f0, 0xc0051e2240}, 0xc007989c80) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go:619 +0x29e github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult(0xc008d69308, {0x77c61f0, 0xc0051e2240}, 0xc0078f11d8) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:340 +0x670 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects(0xc008d69308, {0x77c61f0, 0xc0051e2240}, {0x780c0c0?, 0xc003158508}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:186 +0x46a github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter({0x7f6f381ff818?, 0xc008d696d8}, 0xc0062ca560) github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:214 +0x158 github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch(0xc00779ea78, {0x77c61f0, 0xc0051e2240}, {0x77f9b40, 0xc008d69678}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:295 +0x1f8 github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries(0xc0062caa78, {0x77c61f0, 0xc0051e2240}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:251 +0x9a github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1052 +0x1545 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(_, {_, _}, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:751 +0x19b github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc002ce0a80, 0x77c61f0?) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:660 +0x145 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker(0xc000f6d1f0, {0x77c61f0, 0xc00102cab0}, {0x77a9ac0, 0xc002ce0a80}, 0xc002ce0000) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:418 +0x1f2 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2({0x77c61f0?, 0xc00102cab0?}) github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321 +0x45 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2() github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484 +0x146 created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:475 +0x43b ```
We seem to have deadlocked on replica removal on here, which deadlocks that replica's raftMu. It's not obvious why removing from the scanner would get stuck like this but all sorts of bad things result from it.
TLDR: there's at least one deadlocked replica mu on n2.
An immediate consequence of this is possible raft scheduler starvation. We see lots of the scheduler's goroutines stuck in the same removeReplica
deadlock, since everyone who calls RemoveReplica
on the scanner gets stuck on the same glue trap. This likely explains the majority of badness in this cluster (not why we have quiesced replicas with pending proposals though, that is likely an independent bug we ought to look into). We see also see scheduler threads deadlocked on tryGetReplica
, so it's safe to say there isn't a free scheduler thread on this node.
Ok, so can we find out where the mutex leaked? I don't see an obvious way. I went through all goroutines containing Replica
in their stack, and they all seem to be suffering from the mutex being held, but nobody holding it.
The deadlock reminds me of what we saw in https://github.com/cockroachlabs/support/issues/2387#issuecomment-1602346347 but could never track down.
Since we now seem to have a way in which this should at least be somewhat reproducible, we can annotate the mutex and hack together a version of https://github.com/cockroachdb/cockroach/issues/66765.
I'll get started on that today.
Kicking off the repro machine, on #106254. Let's hope that something shakes out soon, if it does then we'll know who leaked the mutex and that should pretty much settle it. If it doesn't repro... uhm, well, I am willing to try pretty hard to have it repro anyway. It happened once, it will happen again.
roachtest.failover/chaos/read-only failed with artifacts on master @ dbe8511fae8fca21562fdde5c240b1f7d06ef582:
(test_runner.go:1075).runTest: test timed out (1h0m0s)
(cluster.go:2282).Run: output in run_114845.393942892_n10_cockroach-workload-i: ./cockroach workload init kv --splits 1000 --insert-count 100000 {pgurl:1} returned: COMMAND_PROBLEM: exit status 137
(cluster.go:2282).Run: cluster.RunE: context canceled
(cluster.go:2282).Run: cluster.RunE: context canceled
(cluster.go:2282).Run: cluster.RunE: context canceled
(cluster.go:2282).Run: cluster.RunE: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-only/run_1
Parameters: ROACHTEST_arch=amd64
, ROACHTEST_cloud=gce
, ROACHTEST_cpu=2
, ROACHTEST_encrypted=false
, ROACHTEST_fs=ext4
, ROACHTEST_localSSD=false
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
I ran this on master
and something else seems to have broken that's way worse and very frequent. Now the test fails most of the time, with the initial splits getting stuck in an infinite loop, where it's continously seeing a changed descriptor and retrying with the same descriptor - that descriptor looks OK, but somehow the one it sees in the split txn is "old". Something got borked. I suspect it's https://github.com/cockroachdb/cockroach/pull/106104, which, I think, can cause replica change txns to succeed on the range but roll back on meta2. This is being reverted now, in https://github.com/cockroachdb/cockroach/pull/106267.
Well, the good news is that with #106267 we seem to actually be able to run this test. The bad news is, it doesn't seem to want to reproduce the original issue here. I ran all of the 🔴 🟢 in https://github.com/cockroachdb/cockroach/issues/106285 with #106254, i.e. it was doing only the splits and then exiting early, and this never reproduced the problem we are seeing here, over a total of, I'm guessing, 500 runs, including 300 on the SHA of this issue (which predates the other bad commit anyway, so didn't need the revert).
Going to try for 9999 runs overnight, on this issue's SHA.
All of the nightly 1643 runs have passed. I am going to kick of a few hundred on vanilla original SHA (428dc9da6a320de218460de6c6c8807caa4ded98) with an early-exit to see if it's the mutex instrumentation preventing the deadlock.
Just for posterity, I didn't understand why I saw AdminScatter
in the original logs (and a ton of it)
teamcity-10771334-1688449519-65-n10cpu2-0001> I230704 11:52:56.660750 16288 13@kv/kvserver/allocator/plan/replicate.go:900 ⋮ [T1,n1,s1,r17/1:‹/Table/1{5-6}›,kvpb.AdminScatterRequest] 2156 rebalancing ‹voter› n2,s2 to n6,s6: ‹[1:180, 2:180, 3:180, 4:180, 5:180]›,
I was confused by this because we do not see this line printed in the logs
but we do see the "creating workload database" above the workload init. The workload init does not pass --scatter
either, and it defaults to false.
But I found out by code inspection that when the workload creates splits, it also scatters each split:
So everything checks out and the repro attempts, sadly, actually made sense. I was hoping that I had missed something that was present in the original execution.
I am going to kick of a few hundred on vanilla original SHA (https://github.com/cockroachdb/cockroach/commit/428dc9da6a320de218460de6c6c8807caa4ded98) with an early-exit to see if it's the mutex instrumentation preventing the deadlock.
Predictably all 100 passed. I have since run hundreds more and counting.
The full test is passing repeatedly too, just as you'd think it would. There is one flake I have hit three times (out of 181 runs of the test),
--- FAIL: failover/chaos/read-only#107 (2599.08s)
(assertions.go:333).Fail:
Error Trace: github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:1605
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/failover.go:327
main/pkg/cmd/roachtest/monitor.go:105
golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
GOROOT/src/runtime/asm_amd64.s:1594
Error: Received unexpected error:
pq: setting updated but timed out waiting to read new value
Test: failover/chaos/read-only
I've spot checked the logs for one instance and it doesn't look like the deadlock.
I also ran 3151 more iterations of the "shortened" test overnight. They all passed. I'll give this another 24h and then it might be time to cut our losses.
edit, 2 days later:
I'll give this another 24h and then it might be time to cut our losses.
didn't start due to a dumb error on my part. 24h timer starts now.
😢 PASS failover/chaos/read-only (run 2071)
Okay, time to throw the towel:
failover/chaos/read-only (run 3701) PASS
I filed https://github.com/cockroachdb/cockroach/issues/106568 instead (with release-blocker label for now, even though it's not a new bug).
https://github.com/cockroachdb/cockroach/pull/106574 hopefully fixes this.
roachtest.failover/chaos/read-only failed with artifacts on master @ 428dc9da6a320de218460de6c6c8807caa4ded98:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=2
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
/cc @cockroachdb/kv-triage
This test on roachdash | Improve this report!
Jira issue: CRDB-29407