Starting a cluster after all nodes go down

gregory112 commented 1 year ago

Describe the problem

Hi, I have a three nodes CockroachDB cluster. For some reasons, after another issue which I am still troubleshooting, all nodes go down. The nodes can no longer be recovered, because everytime one node is going up it tries to contact the other nodes, fail, and get its breaker tripped, and it goes down again. This happens to all nodes.

To Reproduce

Restart a whole cluster.

Expected behavior

The node waits for a while until all nodes are up before giving up.

Additional data / screenshots

I230427 03:53:56.912123 66 1@server/server.go:1551 ⋮ [n2] 53  starting https server at ‹[::]:8080› (use: ‹spmi-cockroach2.service.consul:8080›)
I230427 03:53:56.912285 66 1@server/server.go:1559 ⋮ [n2] 54  starting grpc/postgres server at ‹[::]:26257›
I230427 03:53:56.912631 66 1@server/server.go:1560 ⋮ [n2] 55  advertising CockroachDB node at ‹spmi-cockroach2.service.consul:26257›
E230427 03:53:56.931901 173 util/log/logcrash/crash_reporting.go:359 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 56  Queued as error ‹b950d558bd81480d8da813d40e3fbec5›
W230427 03:53:57.550406 157 kv/kvserver/raft_transport.go:602 ⋮ [n2] 57  while processing outgoing Raft queue to node 3: ‹EOF›:
I230427 03:53:57.769497 314 2@rpc/context.go:1774 ⋮ [n2,rnode=63,raddr=‹spmi-cockroach3.service.consul:26257›,class=system] 58  dialing
I230427 03:53:57.849785 280 2@rpc/context.go:1774 ⋮ [n2,rnode=3,raddr=‹spmi-cockroach1.service.consul:26257›,class=default] 59  dialing
I230427 03:53:57.849804 279 1@circuitbreaker/circuitbreaker.go:322 ⋮ [n2] 60  circuitbreaker: ‹rpc [::]:26257 [n1]› tripped: unable to look up descriptor for n1
I230427 03:53:57.850092 279 1@circuitbreaker/circuitbreaker.go:447 ⋮ [n2] 61  circuitbreaker: ‹rpc [::]:26257 [n1]› event: ‹BreakerTripped›
W230427 03:53:57.853251 282 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 62  ‹[core]›‹[Channel #17 SubChannel #18] grpc: addrConn.createTransport failed to connect to {›
W230427 03:53:57.853251 282 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 62 +‹  "Addr": "spmi-cockroach1.service.consul:26257",›
W230427 03:53:57.853251 282 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 62 +‹  "ServerName": "spmi-cockroach1.service.consul:26257",›
W230427 03:53:57.853251 282 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 62 +‹  "Attributes": null,›
W230427 03:53:57.853251 282 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 62 +‹  "BalancerAttributes": null,›
W230427 03:53:57.853251 282 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 62 +‹  "Type": 0,›
W230427 03:53:57.853251 282 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 62 +‹  "Metadata": null›
W230427 03:53:57.853251 282 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 62 +‹}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.151.247:26257: connect: connection refused"›
I230427 03:53:57.853420 280 1@circuitbreaker/circuitbreaker.go:322 ⋮ [n2] 63  circuitbreaker: ‹rpc [::]:26257 [n3]› tripped: failed to connect to n3 at ‹spmi-cockroach1.service.consul:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.151.247:26257: connect: connection refused"›
I230427 03:53:57.853846 280 1@circuitbreaker/circuitbreaker.go:447 ⋮ [n2] 64  circuitbreaker: ‹rpc [::]:26257 [n3]› event: ‹BreakerTripped›
W230427 03:53:57.853868 280 2@rpc/nodedialer/nodedialer.go:193 ⋮ [n2] 65  unable to connect to n3: failed to connect to n3 at ‹spmi-cockroach1.service.consul:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.151.247:26257: connect: connection refused"›
I230427 03:53:58.079525 173 1@kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 66  the server is terminating due to a fatal error (see the DEV channel for details)
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67  ???: raft closed timestamp regression in cmd: "[E\x14\xaa/\x8e\x10\xa1" (term: 36, index: 16138); batch state: 1682542323.637366844,0, command: 1682532069.800997084,0, lease: repl=(n1,s1):1 seq=13 start=1682517907.920121659,7 exp=1682542335.536880235,0 pro=1682542326.536880235,0, req: <unknown; not leaseholder>, applying at LAI: 5377.
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +(1) assertion failure
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +Wraps: (2) attached stack trace
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  -- stack trace:
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).assertNoCmdClosedTimestampRegression
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:1156
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).Stage
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:498
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCmdIter
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:184
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:279
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:246
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1045
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:664
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:641
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |  github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:308
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |  github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:489
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | runtime.goexit
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |  GOROOT/src/runtime/asm_amd64.s:1594
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +Wraps: (3) raft closed timestamp regression in cmd: "[E\x14\xaa/\x8e\x10\xa1" (term: 36, index: 16138); batch state: 1682542323.637366844,0, command: 1682532069.800997084,0, lease: repl=(n1,s1):1 seq=13 start=1682517907.920121659,7 exp=1682542335.536880235,0 pro=1682542326.536880235,0, req: <unknown; not leaseholder>, applying at LAI: 5377.
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | Closed timestamp was set by req: <unknown; not leaseholder or not lease request> under lease: <nil>; applied at LAI: 0. Batch idx: 7.
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | This assertion will fire again on restart; to ignore run with env var COCKROACH_RAFT_CLOSEDTS_ASSERTIONS_ENABLED=false
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | Raft log tail:
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | ‹0,0 /Local/RangeID/2/u/RaftLog/logIndex:16144 (0x01698a757266746c0000000000003f1000): Term:36 Index:16144 Type:EntryNormal  by lease #13›
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | ‹proposer_lease_sequence:13 max_lease_index:5383 closed_timestamp:<wall_time:1682532069995897541 > replicated_eval_result:<write_timestamp:<> delta:<sys_bytes:134 sys_count:1 > > logical_op_log:<> ›
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | ‹write batch:›
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | ‹Put: 0,0 /Local/Range/System/NodeLiveness/3/Transaction/"3b217172-5cfc-4c4c-b326-7cfe9c79d6cc" (0x016b120400ff6c6976656e6573732d8b000174786e2d3b2171725cfc4c4cb3267cfe9c79d6cc00): meta={id=3b217172 key=/System/NodeLiveness/3 pri=0.01351554 epo=0 ts=1682532063.523649223,0 min=1682532063.523649223,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0›
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  |
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | ‹0,0 /Local/RangeID/2/u/RaftLog/logIndex:16143 (0x01698a757266746c0000000000003f0f00): Term:36 Index:16143 Type:EntryNormal  by lease #13›
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | ‹proposer_lease_sequence:13 max_lease_index:5382 closed_timestamp:<wall_time:1682532069800997084 > replicated_eval_result:<write_timestamp:<> delta:<sys_bytes:134 sys_count:1 > > logical_op_log:<> ›
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | ‹write batch:›
F230427 03:53:58.079621 173 kv/kvserver/store_raft.go:642 ⋮ [n2,s2,r2/2:‹/System/NodeLiveness{-Max}›,raft] 67 +  | ‹Put: 0,0 /Local/Range/System/NodeLiveness/3/Transaction/"8584cd4f-81aa-44ee-8195-478ccd232ab4" (0x016b120400ff6c6976656e6573732d8b000174786e2d8584cd4f81aa44ee8195478ccd232ab400): meta={id=8584cd4f key=/System/NodeLiveness/3 pri=0.01974707 epo=0 ts=1682532050.091698408,0 min=1682532050.091698408,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0›

If applicable, add screenshots to help explain your problem.

Environment:

CockroachDB version latest-22.2 with Docker
Server OS: Ubuntu 20.04

Additional context

The cluster goes down completely.

Jira issue: CRDB-27423

blathers-crl[bot] commented 1 year ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@cockroachdb/replication (found keywords: raft)
@cockroachdb/kv (found keywords: kv,Liveness)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

knz commented 1 year ago

@erikgrinaker @aliher1911 is this the type of problem that the retry logic we just implemented is aiming to fix?

erikgrinaker commented 1 year ago

No, this is unrelated. This is an assertion failure because a closed timestamp regression has made its way into the Raft log. This could be due to clock/hardware issues, or it could be due to bugs around closed timestamps and reproposals that @tbg has been hunting down.

In any case, to get past the error, you can set COCKROACH_RAFT_CLOSEDTS_ASSERTIONS_ENABLED=false like described in the error message. Because the assertion failure is in the liveness range, this should be harmless (the only effect would be on rangefeeds or follower reads, which are not relevant here).

gregory112 commented 1 year ago

Thanks, adding that environment variable lets all nodes run again. This is related to this issue that I also reported: https://github.com/cockroachdb/cockroach/issues/102316. Is it going to be harmless if I let all nodes run with this environment variable?

erikgrinaker commented 1 year ago

Once you've gotten the cluster back up and running, you should be able to restart the nodes without the envvar again. There were a few pending commands in the Raft log that violated the assertion, once those commands have been applied the envvar is no longer needed.

gregory112 commented 1 year ago

Alright thanks, but based on my experience killing the cluster and starting from scratch again with full cluster restore usually leads to the same problem again after a while, so I think I'll leave it.

Should this issue be closed or be left open? Anything to be done for this issue?

erikgrinaker commented 1 year ago

Alright thanks, but based on my experience killing the cluster and starting from scratch again with full cluster restore usually leads to the same problem again after a while, so I think I'll leave it.

This, and the error in #102316, indicate problems with clock accuracy in this cluster. CockroachDB requires semi-accurate clocks, with a worst-case clock skew well below 500 ms. Keeping this envvar set will only paper over the underlying clock issue, and may affect the correctness of rangefeeds if you should use them (e.g. for changefeeds, follower reads, or internal system operations like propagation of settings or zone configurations).

erikgrinaker commented 1 year ago

I'll close this out since this seems to be an infrastructure issue rather than a CockroachDB issue, but feel free to reopen if you find indications otherwise.

cockroachdb / cockroach

Starting a cluster after all nodes go down #102401