Open alistair-roacher opened 2 years ago
cc @shralex for triage
plan it to look at this in stability, and to assign it at that time as well.
I've been seeing this quite frequently as well, for what it's worth. I don't know whether it causes any real issues outside of wasting small amounts of resources attempting to open connections and spamming the logs, but it is a tad concerning.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Describe the problem Repeated attempts are made by surviving nodes in a CRDB cluster to contact a dead node that has been subsequently been tidied up using the cockroach node decommission command.
Errors similar to the following are written to the cockroach log stream every few seconds:
W220808 12:37:35.423356 313918 google.golang.org/grpc/grpclog/external/org_golang_google_grpc/grpclog/component.go:41 ⋮ [-] 10217 ‹[core]›‹grpc: addrConn.createTransport failed to connect to {ip-192-168-30-204.eu-west-2.compute.internal:26258 ip-192-168-30-204.eu-west-2.compute.internal:26258 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp: operation was canceled"›
These messages only stop being reported after the node reporting them has been restarted.
To Reproduce
What did you do? Describe in your own words.
If possible, provide steps to reproduce the behavior:
Expected behavior Once the dead node has been successfully decommissioned, all other nodes in the cluster should immediately stop attempting to contact the decommissioned node (i.e. the node should not have to be restarted)
Additional data / screenshots If the problem is SQL-related, include a copy of the SQL query and the schema of the supporting tables.
If a node in your cluster encountered a fatal error, supply the contents of the log directories (at minimum of the affected node(s), but preferably all nodes).
Note that log files can contain confidential information. Please continue creating this issue, but contact support@cockroachlabs.com to submit the log files in private.
If applicable, add screenshots to help explain your problem.
Environment:
cockroach sql
, JDBC, ...] No client requiredAdditional context What was the impact?
The impact is that all other nodes in the cluster need to be restarted to prevent connection attempts to the dead node / spurious transport log messages being produced. Having so many messages dumped to the log stream which appear to indicate node-to-node connectivity issues could easily mask genuine issues.
Add any other context about the problem here.
This is part of a monthly repaving exercise where one region (3 nodes) of a 3 region (9 node) cluster is shutdown, deleted and restarted clean. The remaining 6 nodes in the other 2 regions report the transport errors continuously until they themselves are restarted.
Jira issue: CRDB-18408