rpc: repeated attempts to contact a decommissioned dead node unless nodes are restarted

alistair-roacher commented 2 years ago

Describe the problem Repeated attempts are made by surviving nodes in a CRDB cluster to contact a dead node that has been subsequently been tidied up using the cockroach node decommission command.

Errors similar to the following are written to the cockroach log stream every few seconds:

W220808 12:37:35.423356 313918 google.golang.org/grpc/grpclog/external/org_golang_google_grpc/grpclog/component.go:41 ⋮ [-] 10217 ‹[core]›‹grpc: addrConn.createTransport failed to connect to {ip-192-168-30-204.eu-west-2.compute.internal:26258 ip-192-168-30-204.eu-west-2.compute.internal:26258 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp: operation was canceled"›

These messages only stop being reported after the node reporting them has been restarted.

To Reproduce

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

Set up a 3 node CRDB cluster and check all nodes are up / no ranges are underreplicated
Stop one of the nodes, wipe / replace its store disk and restart the node. For example in K8s you could scale the statefulset to 2, delete the PVC/PV for the stopped node and scale the statefulset back up to 3.
A new node joins the cluster (has a different node_id to the node that was stopped in Step 2)
Tidy up the node by running: cockroach node decommission
Check the cockroach.log file in the logs directory on the other 2 nodes. There will be messages being produced every few seconds containing the string "transport: Error while dialing dial tcp: operation was canceled"
Restart one of the nodes that has been part of the cluster from the start - in K8s you can delete the pod and the node will automatically restart. No new log messages will be produced on the restarted node.

Expected behavior Once the dead node has been successfully decommissioned, all other nodes in the cluster should immediately stop attempting to contact the decommissioned node (i.e. the node should not have to be restarted)

Additional data / screenshots If the problem is SQL-related, include a copy of the SQL query and the schema of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue creating this issue, but contact support@cockroachlabs.com to submit the log files in private.

If applicable, add screenshots to help explain your problem.

Environment:

CockroachDB versions positively confirmed: v21.2.5, v21.2.7, v22.1.5
Server OS: [e.g. Linux/Distrib] Ubuntu VMs, EKS pods
Client app [e.g. cockroach sql, JDBC, ...] No client required

Additional context What was the impact?

The impact is that all other nodes in the cluster need to be restarted to prevent connection attempts to the dead node / spurious transport log messages being produced. Having so many messages dumped to the log stream which appear to indicate node-to-node connectivity issues could easily mask genuine issues.

Add any other context about the problem here.

This is part of a monthly repaving exercise where one region (3 nodes) of a 3 region (9 node) cluster is shutdown, deleted and restarted clean. The remaining 6 nodes in the other 2 regions report the transport errors continuously until they themselves are restarted.

Jira issue: CRDB-18408

knz commented 2 years ago

cc @shralex for triage

mwang1026 commented 2 years ago

plan it to look at this in stability, and to assign it at that time as well.

a-robinson commented 2 years ago

I've been seeing this quite frequently as well, for what it's worth. I don't know whether it causes any real issues outside of wasting small amounts of resources attempting to open connections and spamming the logs, but it is a tad concerning.

github-actions[bot] commented 8 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

cockroachdb / cockroach

rpc: repeated attempts to contact a decommissioned dead node unless nodes are restarted #85734