cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

roachtest: clearrange/checks=false failed #54790

Closed cockroach-teamcity closed 3 years ago

cockroach-teamcity commented 4 years ago

(roachtest).clearrange/checks=false failed on release-20.2@445481640c3b9b1b34e537e6da9c853e9a83cc75:

The test failed on branch=release-20.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/clearrange/checks=false/run_1
    test_runner.go:802: test timed out (6h30m0s)

    cluster.go:2651,clearrange.go:179,clearrange.go:34,test_runner.go:755: monitor failure: monitor task failed: context canceled
        (1) attached stack trace
          -- stack trace:
          | main.(*monitor).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2639
          | main.(*monitor).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2647
          | main.runClearRange
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/clearrange.go:179
          | main.registerClearRange.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/clearrange.go:34
          | [...repeated from below...]
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitor).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2695
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1357
        Wraps: (4) monitor task failed
        Wraps: (5) context canceled
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
More

Artifacts: [/clearrange/checks=false](https://teamcity.cockroachdb.com/viewLog.html?buildId=2322277&tab=artifacts#/clearrange/checks=false) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Aclearrange%2Fchecks%3Dfalse.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)

petermattis commented 4 years ago

@jbowens Can you take a look at this failure this morning? We entered the loop waiting for the number of bank ranges to fall to 1 at 8:19:25. And the cockroach node logs show lots of activity up until 8:25. It is quite unfortunate that we're not printing the number of bank ranges on each loop iteration.

Cc @nvanbenschoten because this might require some KV help.

jbowens commented 4 years ago

👀

jbowens commented 4 years ago

There is a debug.zip with some additional information from a single node: n3. It looks like there was a single remaining range 1933.

problemranges.json:

"3": {
      "no_raft_leader_range_ids": [
        1933
      ]
    },

nodes/3/ranges1933.json:

    "start_key": "/Table/53/1/33789752",
    "end_key": "/Table/53/1/33840088"
petermattis commented 4 years ago

@nvanbenschoten Definitely seems like something for KV to take a look at or sign-off on.

nvanbenschoten commented 4 years ago

In node 3's logs, we repeatedly see:

E200925 08:25:14.320388 447 kv/kvserver/store_rebalancer.go:315 ⋮ [n3,s3,store-rebalancer] unable to relocate range to ‹[n7,s7 n5,s5 n8,s8]›: ‹operation "relocate range" timed out after 1m0s›: looking up lease: ‹failed to send RPC: sending to all replicas failed; last error: r1933 was not found on s5›
(1) ‹operation "relocate range" timed out after 1m0s›
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).relocateOne
  |     /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:2341
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).AdminRelocateRange
  |     /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:2143
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).rebalanceStore.func2
  |     /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_rebalancer.go:313
  | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
  |     /go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:135
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).rebalanceStore
  |     /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_rebalancer.go:312
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreRebalancer).Start.func1
  |     /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_rebalancer.go:200
  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1
  |     /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:222
  | runtime.goexit
  |     /usr/local/go/src/runtime/asm_amd64.s:1357
Wraps: (3) looking up lease
Wraps: (4) ‹failed to send RPC: sending to all replicas failed; last error: r1933 was not found on s5›
Error types: (1) *contextutil.TimeoutError (2) *withstack.withStack (3) *errutil.withPrefix (4) *roachpb.internalError
andreimatei commented 4 years ago

We've done around 60 successful runs. We should proceed with the release since we have nothing to go on here...

nvanbenschoten commented 4 years ago

I ran 50 more runs last night, all of which passed. So I do think we should proceed with the RC candidate.

That said, I dug a little more into what we have in the artifacts here and it's pretty interesting. We don't have a lot, except from node 3. But on node 4, we have the entire range report:

debug/nodes/3/ranges/1933.json

``` { "span": { "start_key": "/Table/53/1/33789752", "end_key": "/Table/53/1/33840088" }, "raft_state": { "replica_id": 1, "hard_state": { "term": 9, "vote": 2, "commit": 50 }, "state": "StatePreCandidate", "applied": 50, "progress": null }, "state": { "state": { "raft_applied_index": 50, "lease_applied_index": 27, "desc": { "range_id": 1933, "start_key": "vYn5AgOXOA==", "end_key": "vYn5AgRb2A==", "internal_replicas": [ { "node_id": 3, "store_id": 3, "replica_id": 1 }, { "node_id": 5, "store_id": 5, "replica_id": 2 }, { "node_id": 2, "store_id": 2, "replica_id": 3 } ], "next_replica_id": 4, "generation": 282 }, "lease": { "start": { "wall_time": 1601022109608583568 }, "replica": { "node_id": 5, "store_id": 5, "replica_id": 2 }, "proposed_ts": { "wall_time": 1601022109608585055 }, "epoch": 3, "sequence": 134 }, "truncated_state": { "index": 28, "term": 7 }, "gc_threshold": {}, "stats": { "contains_estimates": 0, "last_update_nanos": 1601022098930487996, "intent_age": 0, "gc_bytes_age": 0, "live_bytes": 0, "live_count": 0, "key_bytes": 0, "key_count": 0, "val_bytes": 0, "val_count": 0, "intent_bytes": 0, "intent_count": 0, "sys_bytes": 3915, "sys_count": 14, "abort_span_bytes": 83 }, "using_applied_state_key": true }, "last_index": 50, "raft_log_size": 4343, "range_max_bytes": 536870912, "newest_closed_timestamp": { "node_id": 3, "closed_timestamp": { "wall_time": 1601042493577751505 }, "mlai": 27, "epoch": 2 }, "active_closed_timestamp": { "wall_time": 1601022109608583568 }, "tenant_id": 1, "lock_table": "global: num=0\nlocal: num=0\n" }, "source_node_id": 3, "source_store_id": 3, "lease_history": [ { "start": { "wall_time": 1601021930595652885 }, "replica": { "node_id": 3, "store_id": 3, "replica_id": 1 }, "proposed_ts": { "wall_time": 1601021930595658404 }, "epoch": 2, "sequence": 133 }, { "start": { "wall_time": 1601022109608583568 }, "replica": { "node_id": 5, "store_id": 5, "replica_id": 2 }, "proposed_ts": { "wall_time": 1601022109608585055 }, "epoch": 3, "sequence": 134 } ], "problems": { "no_raft_leader": true }, "stats": { "queries_per_second": 4.216150118004028 }, "latches_local": {}, "latches_global": {}, "lease_status": { "lease": { "start": { "wall_time": 1601022109608583568 }, "replica": { "node_id": 5, "store_id": 5, "replica_id": 2 }, "proposed_ts": { "wall_time": 1601022109608585055 }, "epoch": 3, "sequence": 134 }, "timestamp": { "wall_time": 1601042676320069494 }, "state": 3, "liveness": { "node_id": 5, "epoch": 3, "expiration": { "wall_time": 1601042493847235041, "logical": 0 } } }, "ticking": true } ```

Notice that the replica is in a StatePreCandidate state, which implies that it can't communicate with either of the other replicas in the Raft group. We also see that node 5 is the current leaseholder.

But if we look at node 5, we see a similar log to what we saw on node 3:

W200925 08:22:50.783090 117463 kv/kvclient/kvcoord/dist_sender.go:1499 ⋮ [n5] slow range RPC: ‹have been waiting 60.26s (60 attempts) for RPC ResolveIntent [/Local/Range/Table/53/1/33789752/RangeDescriptor,/Min) to r1933:/Table/53/1/33{789752-840088} [(n3,s3):1, (n5,s5):2, (n2,s2):3, next=4, gen=282]; resp: failed to send RPC: sending to all replicas failed; last error: r1933 was not found on s5›

But in this case, r1933 on n5 can't route to itself to resolve an intent on its local range descriptor. So it kind of looks like node 2 and node 5 removed their replicas for r1933 and now node 3 is confused. Based on this one log, my best guess for why this could be is that we're doing something wrong with the speculative range descriptor (the one pulled from an intent in the meta range) in the range descriptor cache during a merge that's tripping up intent resolution. I'm going to try to remember how this is supposed to work while continuing to fire off these tests.

nvanbenschoten commented 4 years ago

50 more successful runs.

nvanbenschoten commented 4 years ago

100 more successful runs. This one's tough to repro.

irfansharif commented 3 years ago

@nvanbenschoten, I'm going to close this one out cause it's old. Feel free to re-open it. This test has been running on master for ~5months now, so it's definitely not very reproducible.