Closed rafiss closed 10 months ago
cc @cockroachdb/replication
Warnings from KV distribution logs cockroach-kv-distribution.teamcity-12702441-1700091323-06-n4cpu4-0001.ubuntu.2023-11-15T23_47_19Z.011989.log
These are expected. The store is excluded from the store descriptor list when draining/dead/suspect -- which prevents the store rebalancer from running. This behavior is usually helpful to prevent poor decisions.
I'll take a look to see why n1
was unable to transfer away r426
.
The lease didn't transfer because the other replicas for r426
on n2
and n3
were in StateProbe
:
``` { "span": { "start_key": "/Table/192", "end_key": "/Table/193" }, "raft_state": { "replica_id": 1, "hard_state": { "term": 9, "vote": 1, "commit": 30 }, "lead": 1, "state": "StateLeader", "applied": 30, "progress": { "1": { "match": 30, "next": 31, "state": "StateReplicate" }, "2": { "match": 30, "next": 31, "state": "StateProbe" }, "3": { "match": 30, "next": 31, "state": "StateProbe" } } }, "state": { "state": { "raft_applied_index": 30, "lease_applied_index": 10, "desc": { "range_id": 426, "start_key": "9sA=", "end_key": "9sE=", "internal_replicas": [ { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, { "node_id": 2, "store_id": 2, "replica_id": 2, "type": 0 }, { "node_id": 3, "store_id": 3, "replica_id": 3, "type": 0 } ], "next_replica_id": 4, "generation": 39, "sticky_bit": {} }, "lease": { "start": { "wall_time": 1700092018354238817 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1700092018357644666 }, "epoch": 34, "sequence": 26, "acquisition_type": 2 }, "truncated_state": { "index": 10, "term": 5 }, "gc_threshold": {}, "stats": { "contains_estimates": 0, "last_update_nanos": 1700091926198272055, "lock_age": 0, "gc_bytes_age": 0, "live_bytes": 0, "live_count": 0, "key_bytes": 0, "key_count": 0, "val_bytes": 0, "val_count": 0, "intent_bytes": 0, "intent_count": 0, "lock_bytes": 0, "lock_count": 0, "range_key_count": 0, "range_key_bytes": 0, "range_val_count": 0, "range_val_bytes": 0, "sys_bytes": 520, "sys_count": 8, "abort_span_bytes": 81 }, "version": { "major": 21, "minor": 2, "patch": 0, "internal": 56 }, "raft_closed_timestamp": { "wall_time": 1700092026759330669 }, "raft_applied_index_term": 9, "gc_hint": { "latest_range_delete_timestamp": { "wall_time": 1700092029759064237 }, "gc_timestamp": {}, "gc_timestamp_next": {} } }, "last_index": 30, "raft_log_size": 4792, "raft_log_size_trusted": true, "approximate_proposal_quota": 8388608, "proposal_quota_base_index": 30, "range_max_bytes": 67108864, "active_closed_timestamp": { "wall_time": 1700092171224602074 }, "tenant_id": 1, "closed_timestamp_sidetransport_info": { "replica_closed": { "wall_time": 1700092171224602074 }, "replica_lai": 10, "central_closed": {} } }, "source_node_id": 1, "source_store_id": 1, "lease_history": [ { "start": { "wall_time": 1700092000095188671 }, "expiration": { "wall_time": 1700092006095132297 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 2, "type": 0 }, "proposed_ts": { "wall_time": 1700092000095132297 }, "sequence": 23, "acquisition_type": 1 }, { "start": { "wall_time": 1700092000095188671 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 2, "type": 0 }, "proposed_ts": { "wall_time": 1700092000100002640 }, "epoch": 31, "sequence": 24, "acquisition_type": 2 }, { "start": { "wall_time": 1700092018354238817 }, "expiration": { "wall_time": 1700092024354184870 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1700092018354184870 }, "sequence": 25, "acquisition_type": 1 }, { "start": { "wall_time": 1700092018354238817 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1700092018357644666 }, "epoch": 34, "sequence": 26, "acquisition_type": 2 } ], "problems": {}, "stats": { "queries_per_second": 0.006411167572789356, "writes_per_second": 0.006411167561855939, "requests_per_second": 0.006411167572542738, "write_bytes_per_second": 1.301467013404659, "cpu_time_per_second": 87085.81222829696 }, "lease_status": { "lease": { "start": { "wall_time": 1700092018354238817 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1700092018357644666 }, "epoch": 34, "sequence": 26, "acquisition_type": 2 }, "now": { "wall_time": 1700092174335964515 }, "request_time": { "wall_time": 1700092174335964515 }, "state": 1, "liveness": { "node_id": 1, "epoch": 34, "expiration": { "wall_time": 1700092179192615356, "logical": 0 }, "draining": true }, "min_valid_observed_timestamp": { "wall_time": 1700092018354238817 } }, "ticking": true, "top_k_locks_by_wait_queue_waiters": null, "locality": { "tiers": [ { "key": "cloud", "value": "gce" }, { "key": "region", "value": "us-east1" }, { "key": "zone", "value": "us-east1-b" } ] }, "is_leaseholder": true, "lease_valid": true }, ```
The two replicas match the leader commit, I couldn't find any indication why they'd be in StateProbe
.
roachtest.acceptance/version-upgrade failed with artifacts on master @ caff15394fcbe37208b46b2973714c27cc3a1417:
(mixedversion.go:540).Run: mixed-version test failure while running step 15 (run "test features"): pq: internal error: deadline below read timestamp is nonsensical; txn has would have no chance to commit. Deadline: 1700894239.099954982,1. Read timestamp: 1700894239.126264174,0 Previous Deadline: 1700894533.225568165,0.
test artifacts and logs in: /artifacts/acceptance/version-upgrade/run_1
Parameters: ROACHTEST_arch=amd64
, ROACHTEST_cloud=gce
, ROACHTEST_cpu=4
, ROACHTEST_encrypted=false
, ROACHTEST_metamorphicBuild=false
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/roachtest-grafana/teamcity-12832189/acceptance-version-upgrade/1700894118646/1700894378092)
The two replicas match the leader commit, I couldn't find any indication why they'd be in StateProbe.
Yeah, something's off about this. The leader should move them back to StateReplicate
after a heartbeat/MsgApp. The range wasn't quiesced, and neither of the replicas were paused (in the MsgAppFlowPaused
sense). We had RPC connections to all nodes. If the replicas weren't receiving or responding to heartbeats they should campaign and the leader should also step down.
Don't have any immediate ideas, but will follow up later.
cc @cockroachdb/replication
Please disregard the failure above. It's a flake that is being addressed separately.
Tentatively marking as GA blocker until I get a chance to look at this.
Assigning as P1 because of the GA blocker.
Ignore the incorrect issue references from #115559.
The stuck StateProbe
should have been fixed by https://github.com/etcd-io/raft/pull/52, which includes a (passing) test case. Looking closer.
Looked over the code, which seems reasonable. However, it seems like we're not receiving MsgAppResp
from the followers here. n1 only has a single Raft leader:
The leader is receiving 2 MsgHeartbeatResp
per second, i.e. one from each follower:
But it's not receiving any MsgAppResp
:
It's unclear whether we're even sending the MsgApp
in the first place.
Raft transport isn't dropping any messages, so we should be connected:
This would be explained by having a functional system
RPC connection, but not a functional default
RPC connection, since heartbeats go across the system
class. 🤔 However, RPC logs show that all connection classes are successfully established between n1 and n2,n3.
The stuck
StateProbe
should have been fixed by etcd-io/raft#52, which includes a (passing) test case. Looking closer.
This is a mixed-version test though. Have we ported that fix to 23.1?
n1
is on 23.1 during this last bit:
I231115 23:47:19.238330 12481 util/log/file_sync_buffer.go:238 ⋮ [T1,config] binary: CockroachDB CCL v23.1.3 (x86_64-pc-linux-gnu, built 2023/06/08 22:36:13, go1.19.4)
There is a backport, but was it after 23.1.3? Most certainly looks like it, given the dates.
Oh ffs. 🤦 Thanks, that saved me a couple of hours chasing my own tail. Backport isn't in 23.1.3, it landed for 23.1.9.
Known issue, closing.
@renatolabs -- based on the above, would it make sense to use AlwaysUseLatestPredecessors
for this mixed version test?
I believe we already do, this is an old failure.
roachtest.acceptance/version-upgrade failed with artifacts on master @ e19c24fb62d24595e74c0bae0aaad0a736c2bdc7:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_metamorphicBuild=false
,ROACHTEST_ssd=0
Help
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/roachtest-grafana/teamcity-12702441/acceptance-version-upgrade/1700091656339/1700092295246)
CRDB logs from node 1 show this being repeatedly logged during the shutdown:
Warnings from KV distribution logs cockroach-kv-distribution.teamcity-12702441-1700091323-06-n4cpu4-0001.ubuntu.2023-11-15T23_47_19Z.011989.log
This test on roachdash | Improve this report!
Jira issue: CRDB-33554