cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

roachtest: acceptance/version-upgrade failed [shutdown error due to no suitable transfer target found] #114549

Closed rafiss closed 10 months ago

rafiss commented 11 months ago

roachtest.acceptance/version-upgrade failed with artifacts on master @ e19c24fb62d24595e74c0bae0aaad0a736c2bdc7:

(mixedversion.go:547).Run: mixed-version test failure while running step 34 (restart node 1 with binary version master): COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/acceptance/version-upgrade/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/roachtest-grafana/teamcity-12702441/acceptance-version-upgrade/1700091656339/1700092295246)

[mixed-version-test/34_restart-node-1-with-binary-version-master] 23:49:16 runner.go:336: mixed-version test failure while running step 34 (restart node 1 with binary version master): COMMAND_PROBLEM: exit status 1

CRDB logs from node 1 show this being repeatedly logged during the shutdown:

I231115 23:49:17.685544 55860 kv/kvserver/store.go:1759 ⋮ [T1,drain,n1,s1,r426/1:‹/Table/19{2-3}›] 1453  failed to transfer lease repl=(n1,s1):1 seq=26 start=1700092018.354238817,0 epo=34 pro=1700092018.357644666,0 for range r426:‹/Table/19{2-3}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=39] when draining: ‹no suitable transfer target found›
I231115 23:49:17.685626 55860 kv/kvserver/store.go:1760 ⋮ [T1,drain,n1,s1,r426/1:‹/Table/19{2-3}›] 1454  blocked for 93 microseconds on transfer attempt
W231115 23:49:18.556118 16550 kv/kvserver/store.go:1834 ⋮ [drain] 1455  unable to drain cleanly within 5s (cluster setting ‹server.shutdown.lease_transfer_wait›), service might briefly deteriorate if the node is terminated: waiting for 1 replicas to transfer their lease away

Warnings from KV distribution logs cockroach-kv-distribution.teamcity-12702441-1700091323-06-n4cpu4-0001.ubuntu.2023-11-15T23_47_19Z.011989.log

W231115 23:47:51.614476 287 13@kv/kvserver/store_rebalancer.go:323 ⋮ [T1,n1,s1,store-rebalancer,obj=‹cpu›] 157  StorePool missing descriptor for local store with ID 1, store list ‹  candidate: avg-ranges=161.50 avg-leases=108.50 avg-disk-usage=19 MiB avg-queries-per-second=10.44 avg-store-cpu-per-second=4ms›
W231115 23:47:51.614476 287 13@kv/kvserver/store_rebalancer.go:323 ⋮ [T1,n1,s1,store-rebalancer,obj=‹cpu›] 157 +‹  4: ranges=158 leases=154 disk-usage=12 MiB queries-per-second=19.74 store-cpu-per-second=6ms io-overload=0.00›
W231115 23:47:51.614476 287 13@kv/kvserver/store_rebalancer.go:323 ⋮ [T1,n1,s1,store-rebalancer,obj=‹cpu›] 157 +‹  2: ranges=165 leases=63 disk-usage=25 MiB queries-per-second=1.14 store-cpu-per-second=2ms io-overload=0.00›
W231115 23:47:51.614618 287 13@kv/kvserver/store_rebalancer.go:431 ⋮ [T1,n1,s1,store-rebalancer,obj=‹cpu›] 158  no rebalance context given, bailing out of rebalancing store, will try again later
W231115 23:48:39.194253 287 13@kv/kvserver/store_rebalancer.go:323 ⋮ [T1,n1,s1,store-rebalancer,obj=‹cpu›] 159  StorePool missing descriptor for local store with ID 1, store list ‹  candidate: avg-ranges=163.67 avg-leases=73.67 avg-disk-usage=26 MiB avg-queries-per-second=14.23 avg-store-cpu-per-second=8ms›
W231115 23:48:39.194253 287 13@kv/kvserver/store_rebalancer.go:323 ⋮ [T1,n1,s1,store-rebalancer,obj=‹cpu›] 159 +‹  4: ranges=159 leases=138 disk-usage=12 MiB queries-per-second=24.22 store-cpu-per-second=6ms io-overload=0.00›
W231115 23:48:39.194253 287 13@kv/kvserver/store_rebalancer.go:323 ⋮ [T1,n1,s1,store-rebalancer,obj=‹cpu›] 159 +‹  3: ranges=167 leases=12 disk-usage=38 MiB queries-per-second=0.83 store-cpu-per-second=7ms io-overload=0.00›
W231115 23:48:39.194253 287 13@kv/kvserver/store_rebalancer.go:323 ⋮ [T1,n1,s1,store-rebalancer,obj=‹cpu›] 159 +‹  2: ranges=165 leases=71 disk-usage=27 MiB queries-per-second=17.65 store-cpu-per-second=12ms io-overload=0.00›
W231115 23:48:39.194389 287 13@kv/kvserver/store_rebalancer.go:431 ⋮ [T1,n1,s1,store-rebalancer,obj=‹cpu›] 160  no rebalance context given, bailing out of rebalancing store, will try again later

This test on roachdash | Improve this report!

Jira issue: CRDB-33554

blathers-crl[bot] commented 11 months ago

cc @cockroachdb/replication

kvoli commented 11 months ago

Warnings from KV distribution logs cockroach-kv-distribution.teamcity-12702441-1700091323-06-n4cpu4-0001.ubuntu.2023-11-15T23_47_19Z.011989.log

These are expected. The store is excluded from the store descriptor list when draining/dead/suspect -- which prevents the store rebalancer from running. This behavior is usually helpful to prevent poor decisions.

https://github.com/cockroachdb/cockroach/blob/6d4a8274e9fede7e2f65f69698d30b7c3cd9dbaa/pkg/kv/kvserver/allocator/storepool/store_pool.go#L1146-L1147

I'll take a look to see why n1 was unable to transfer away r426.

kvoli commented 10 months ago

The lease didn't transfer because the other replicas for r426 on n2 and n3 were in StateProbe:

https://github.com/cockroachdb/cockroach/blob/64dacab2808c5420e3fd99d2562e8ecb97b7742b/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go#L2886-L2886

Details

``` { "span": { "start_key": "/Table/192", "end_key": "/Table/193" }, "raft_state": { "replica_id": 1, "hard_state": { "term": 9, "vote": 1, "commit": 30 }, "lead": 1, "state": "StateLeader", "applied": 30, "progress": { "1": { "match": 30, "next": 31, "state": "StateReplicate" }, "2": { "match": 30, "next": 31, "state": "StateProbe" }, "3": { "match": 30, "next": 31, "state": "StateProbe" } } }, "state": { "state": { "raft_applied_index": 30, "lease_applied_index": 10, "desc": { "range_id": 426, "start_key": "9sA=", "end_key": "9sE=", "internal_replicas": [ { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, { "node_id": 2, "store_id": 2, "replica_id": 2, "type": 0 }, { "node_id": 3, "store_id": 3, "replica_id": 3, "type": 0 } ], "next_replica_id": 4, "generation": 39, "sticky_bit": {} }, "lease": { "start": { "wall_time": 1700092018354238817 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1700092018357644666 }, "epoch": 34, "sequence": 26, "acquisition_type": 2 }, "truncated_state": { "index": 10, "term": 5 }, "gc_threshold": {}, "stats": { "contains_estimates": 0, "last_update_nanos": 1700091926198272055, "lock_age": 0, "gc_bytes_age": 0, "live_bytes": 0, "live_count": 0, "key_bytes": 0, "key_count": 0, "val_bytes": 0, "val_count": 0, "intent_bytes": 0, "intent_count": 0, "lock_bytes": 0, "lock_count": 0, "range_key_count": 0, "range_key_bytes": 0, "range_val_count": 0, "range_val_bytes": 0, "sys_bytes": 520, "sys_count": 8, "abort_span_bytes": 81 }, "version": { "major": 21, "minor": 2, "patch": 0, "internal": 56 }, "raft_closed_timestamp": { "wall_time": 1700092026759330669 }, "raft_applied_index_term": 9, "gc_hint": { "latest_range_delete_timestamp": { "wall_time": 1700092029759064237 }, "gc_timestamp": {}, "gc_timestamp_next": {} } }, "last_index": 30, "raft_log_size": 4792, "raft_log_size_trusted": true, "approximate_proposal_quota": 8388608, "proposal_quota_base_index": 30, "range_max_bytes": 67108864, "active_closed_timestamp": { "wall_time": 1700092171224602074 }, "tenant_id": 1, "closed_timestamp_sidetransport_info": { "replica_closed": { "wall_time": 1700092171224602074 }, "replica_lai": 10, "central_closed": {} } }, "source_node_id": 1, "source_store_id": 1, "lease_history": [ { "start": { "wall_time": 1700092000095188671 }, "expiration": { "wall_time": 1700092006095132297 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 2, "type": 0 }, "proposed_ts": { "wall_time": 1700092000095132297 }, "sequence": 23, "acquisition_type": 1 }, { "start": { "wall_time": 1700092000095188671 }, "replica": { "node_id": 2, "store_id": 2, "replica_id": 2, "type": 0 }, "proposed_ts": { "wall_time": 1700092000100002640 }, "epoch": 31, "sequence": 24, "acquisition_type": 2 }, { "start": { "wall_time": 1700092018354238817 }, "expiration": { "wall_time": 1700092024354184870 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1700092018354184870 }, "sequence": 25, "acquisition_type": 1 }, { "start": { "wall_time": 1700092018354238817 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1700092018357644666 }, "epoch": 34, "sequence": 26, "acquisition_type": 2 } ], "problems": {}, "stats": { "queries_per_second": 0.006411167572789356, "writes_per_second": 0.006411167561855939, "requests_per_second": 0.006411167572542738, "write_bytes_per_second": 1.301467013404659, "cpu_time_per_second": 87085.81222829696 }, "lease_status": { "lease": { "start": { "wall_time": 1700092018354238817 }, "replica": { "node_id": 1, "store_id": 1, "replica_id": 1, "type": 0 }, "proposed_ts": { "wall_time": 1700092018357644666 }, "epoch": 34, "sequence": 26, "acquisition_type": 2 }, "now": { "wall_time": 1700092174335964515 }, "request_time": { "wall_time": 1700092174335964515 }, "state": 1, "liveness": { "node_id": 1, "epoch": 34, "expiration": { "wall_time": 1700092179192615356, "logical": 0 }, "draining": true }, "min_valid_observed_timestamp": { "wall_time": 1700092018354238817 } }, "ticking": true, "top_k_locks_by_wait_queue_waiters": null, "locality": { "tiers": [ { "key": "cloud", "value": "gce" }, { "key": "region", "value": "us-east1" }, { "key": "zone", "value": "us-east1-b" } ] }, "is_leaseholder": true, "lease_valid": true }, ```

The two replicas match the leader commit, I couldn't find any indication why they'd be in StateProbe.

cockroach-teamcity commented 10 months ago

roachtest.acceptance/version-upgrade failed with artifacts on master @ caff15394fcbe37208b46b2973714c27cc3a1417:

(mixedversion.go:540).Run: mixed-version test failure while running step 15 (run "test features"): pq: internal error: deadline below read timestamp is nonsensical; txn has would have no chance to commit. Deadline: 1700894239.099954982,1. Read timestamp: 1700894239.126264174,0 Previous Deadline: 1700894533.225568165,0.
test artifacts and logs in: /artifacts/acceptance/version-upgrade/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7) See: [Grafana](https://go.crdb.dev/roachtest-grafana/teamcity-12832189/acceptance-version-upgrade/1700894118646/1700894378092)

This test on roachdash | Improve this report!

erikgrinaker commented 10 months ago

The two replicas match the leader commit, I couldn't find any indication why they'd be in StateProbe.

Yeah, something's off about this. The leader should move them back to StateReplicate after a heartbeat/MsgApp. The range wasn't quiesced, and neither of the replicas were paused (in the MsgAppFlowPaused sense). We had RPC connections to all nodes. If the replicas weren't receiving or responding to heartbeats they should campaign and the leader should also step down.

Don't have any immediate ideas, but will follow up later.

blathers-crl[bot] commented 10 months ago

cc @cockroachdb/replication

renatolabs commented 10 months ago

Please disregard the failure above. It's a flake that is being addressed separately.

erikgrinaker commented 10 months ago

Tentatively marking as GA blocker until I get a chance to look at this.

kvoli commented 10 months ago

Assigning as P1 because of the GA blocker.

kvoli commented 10 months ago

Ignore the incorrect issue references from #115559.

erikgrinaker commented 10 months ago

The stuck StateProbe should have been fixed by https://github.com/etcd-io/raft/pull/52, which includes a (passing) test case. Looking closer.

erikgrinaker commented 10 months ago

Looked over the code, which seems reasonable. However, it seems like we're not receiving MsgAppResp from the followers here. n1 only has a single Raft leader:

Screenshot 2023-12-11 at 15 26 28

The leader is receiving 2 MsgHeartbeatResp per second, i.e. one from each follower:

Screenshot 2023-12-11 at 15 27 39

But it's not receiving any MsgAppResp:

Screenshot 2023-12-11 at 15 28 34

It's unclear whether we're even sending the MsgApp in the first place.

erikgrinaker commented 10 months ago

Raft transport isn't dropping any messages, so we should be connected:

Screenshot 2023-12-11 at 15 49 59
erikgrinaker commented 10 months ago

This would be explained by having a functional system RPC connection, but not a functional default RPC connection, since heartbeats go across the system class. 🤔 However, RPC logs show that all connection classes are successfully established between n1 and n2,n3.

pav-kv commented 10 months ago

The stuck StateProbe should have been fixed by etcd-io/raft#52, which includes a (passing) test case. Looking closer.

This is a mixed-version test though. Have we ported that fix to 23.1?

n1 is on 23.1 during this last bit:

I231115 23:47:19.238330 12481 util/log/file_sync_buffer.go:238 ⋮ [T1,config]   binary: CockroachDB CCL v23.1.3 (x86_64-pc-linux-gnu, built 2023/06/08 22:36:13, go1.19.4)

There is a backport, but was it after 23.1.3? Most certainly looks like it, given the dates.

erikgrinaker commented 10 months ago

Oh ffs. 🤦 Thanks, that saved me a couple of hours chasing my own tail. Backport isn't in 23.1.3, it landed for 23.1.9.

Known issue, closing.

rafiss commented 10 months ago

@renatolabs -- based on the above, would it make sense to use AlwaysUseLatestPredecessors for this mixed version test?

erikgrinaker commented 10 months ago

I believe we already do, this is an old failure.