Closed cockroach-teamcity closed 3 years ago
@jbowens Can you take a look at this failure this morning? We entered the loop waiting for the number of bank ranges to fall to 1 at 8:19:25. And the cockroach node logs show lots of activity up until 8:25. It is quite unfortunate that we're not printing the number of bank ranges on each loop iteration.
Cc @nvanbenschoten because this might require some KV help.
👀
There is a debug.zip with some additional information from a single node: n3. It looks like there was a single remaining range 1933
.
problemranges.json
:
"3": {
"no_raft_leader_range_ids": [
1933
]
},
nodes/3/ranges1933.json
:
"start_key": "/Table/53/1/33789752",
"end_key": "/Table/53/1/33840088"
@nvanbenschoten Definitely seems like something for KV to take a look at or sign-off on.
In node 3's logs, we repeatedly see:
E200925 08:25:14.320388 447 kv/kvserver/store_rebalancer.go:315 ⋮ [n3,s3,store-rebalancer] unable to relocate range to ‹[n7,s7 n5,s5 n8,s8]›: ‹operation "relocate range" timed out after 1m0s›: looking up lease: ‹failed to send RPC: sending to all replicas failed; last error: r1933 was not found on s5›
(1) ‹operation "relocate range" timed out after 1m0s›
Wraps: (2) attached stack trace
-- stack trace:
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).relocateOne
| /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:2341
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).AdminRelocateRange
| /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:2143
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).rebalanceStore.func2
| /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_rebalancer.go:313
| github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
| /go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:135
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).rebalanceStore
| /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_rebalancer.go:312
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).Start.func1
| /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_rebalancer.go:200
| github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1
| /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:222
| runtime.goexit
| /usr/local/go/src/runtime/asm_amd64.s:1357
Wraps: (3) looking up lease
Wraps: (4) ‹failed to send RPC: sending to all replicas failed; last error: r1933 was not found on s5›
Error types: (1) *contextutil.TimeoutError (2) *withstack.withStack (3) *errutil.withPrefix (4) *roachpb.internalError
We've done around 60 successful runs. We should proceed with the release since we have nothing to go on here...
I ran 50 more runs last night, all of which passed. So I do think we should proceed with the RC candidate.
That said, I dug a little more into what we have in the artifacts here and it's pretty interesting. We don't have a lot, except from node 3. But on node 4, we have the entire range report:
``` { "span": { "start_key": "/Table/53/1/33789752", "end_key": "/Table/53/1/33840088" }, "raft_state": { "replica_id": 1, "hard_state": { "term": 9, "vote": 2, "commit": 50 }, "state": "StatePreCandidate", "applied": 50, "progress": null }, "state": { "state": { "raft_applied_index": 50, "lease_applied_index": 27, "desc": { "range_id": 1933, "start_key": "vYn5AgOXOA==", "end_key": "vYn5AgRb2A==", "internal_replicas": [ { "node_id": 3, "store_id": 3, "replica_id": 1 }, { "node_id": 5, "store_id": 5, "replica_id": 2 }, { "node_id": 2, "store_id": 2, "replica_id": 3 } ], "next_replica_id": 4, "generation": 282 }, "lease": { "start": { "wall_time": 1601022109608583568 }, "replica": { "node_id": 5, "store_id": 5, "replica_id": 2 }, "proposed_ts": { "wall_time": 1601022109608585055 }, "epoch": 3, "sequence": 134 }, "truncated_state": { "index": 28, "term": 7 }, "gc_threshold": {}, "stats": { "contains_estimates": 0, "last_update_nanos": 1601022098930487996, "intent_age": 0, "gc_bytes_age": 0, "live_bytes": 0, "live_count": 0, "key_bytes": 0, "key_count": 0, "val_bytes": 0, "val_count": 0, "intent_bytes": 0, "intent_count": 0, "sys_bytes": 3915, "sys_count": 14, "abort_span_bytes": 83 }, "using_applied_state_key": true }, "last_index": 50, "raft_log_size": 4343, "range_max_bytes": 536870912, "newest_closed_timestamp": { "node_id": 3, "closed_timestamp": { "wall_time": 1601042493577751505 }, "mlai": 27, "epoch": 2 }, "active_closed_timestamp": { "wall_time": 1601022109608583568 }, "tenant_id": 1, "lock_table": "global: num=0\nlocal: num=0\n" }, "source_node_id": 3, "source_store_id": 3, "lease_history": [ { "start": { "wall_time": 1601021930595652885 }, "replica": { "node_id": 3, "store_id": 3, "replica_id": 1 }, "proposed_ts": { "wall_time": 1601021930595658404 }, "epoch": 2, "sequence": 133 }, { "start": { "wall_time": 1601022109608583568 }, "replica": { "node_id": 5, "store_id": 5, "replica_id": 2 }, "proposed_ts": { "wall_time": 1601022109608585055 }, "epoch": 3, "sequence": 134 } ], "problems": { "no_raft_leader": true }, "stats": { "queries_per_second": 4.216150118004028 }, "latches_local": {}, "latches_global": {}, "lease_status": { "lease": { "start": { "wall_time": 1601022109608583568 }, "replica": { "node_id": 5, "store_id": 5, "replica_id": 2 }, "proposed_ts": { "wall_time": 1601022109608585055 }, "epoch": 3, "sequence": 134 }, "timestamp": { "wall_time": 1601042676320069494 }, "state": 3, "liveness": { "node_id": 5, "epoch": 3, "expiration": { "wall_time": 1601042493847235041, "logical": 0 } } }, "ticking": true } ```
Notice that the replica is in a StatePreCandidate
state, which implies that it can't communicate with either of the other replicas in the Raft group. We also see that node 5 is the current leaseholder.
But if we look at node 5, we see a similar log to what we saw on node 3:
W200925 08:22:50.783090 117463 kv/kvclient/kvcoord/dist_sender.go:1499 ⋮ [n5] slow range RPC: ‹have been waiting 60.26s (60 attempts) for RPC ResolveIntent [/Local/Range/Table/53/1/33789752/RangeDescriptor,/Min) to r1933:/Table/53/1/33{789752-840088} [(n3,s3):1, (n5,s5):2, (n2,s2):3, next=4, gen=282]; resp: failed to send RPC: sending to all replicas failed; last error: r1933 was not found on s5›
But in this case, r1933 on n5 can't route to itself to resolve an intent on its local range descriptor. So it kind of looks like node 2 and node 5 removed their replicas for r1933 and now node 3 is confused. Based on this one log, my best guess for why this could be is that we're doing something wrong with the speculative range descriptor (the one pulled from an intent in the meta range) in the range descriptor cache during a merge that's tripping up intent resolution. I'm going to try to remember how this is supposed to work while continuing to fire off these tests.
50 more successful runs.
100 more successful runs. This one's tough to repro.
@nvanbenschoten, I'm going to close this one out cause it's old. Feel free to re-open it. This test has been running on master for ~5months now, so it's definitely not very reproducible.
(roachtest).clearrange/checks=false failed on release-20.2@445481640c3b9b1b34e537e6da9c853e9a83cc75:
More
Artifacts: [/clearrange/checks=false](https://teamcity.cockroachdb.com/viewLog.html?buildId=2322277&tab=artifacts#/clearrange/checks=false) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Aclearrange%2Fchecks%3Dfalse.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)