Open tbg opened 2 years ago
cc @cockroachdb/replication
I think it would be helpful to reproduce this with the env var GRPC_GO_LOG_SEVERITY_LEVEL=info
(or changing this^1 to info).
But I think this has to do with how the connection pool works. Not that it should work that way, but I think somehow what happens is that we have a connection in the pool (on n1, the not-restarted-node in the artifacts) that still looks "healthy" even though grpc's first dial has already hit an error (guessing: connection refused). The conn is then handed out here and survives the error check:
but when we actually try to make a FlowStream client and connect, we prompt a redial (or something like that) and run into the onlyOnceDialer
(which is something that we use internally to make sure that gRPC always lets our rpc wrappers do a new connection; this has to do with version checks etc) here:
This doesn't all obviously check out, since there is a health check when pulling the connection from the pool,
and I don't know why we're getting past it.
It might be helpful to not try to reproduce this the full version-upgrade test, but simply a regular test where we repeatedly drain and instantly restart a node, then run a distributed query on it.
Or maybe even better, add some ad-hoc code to a build that verifies an invariant that we want: that once the restart of the down node is acked, we immediately no longer see any of these connection errors on the other nodes. (Needs some tricky to make it work: probably best to run in local mode, and to touch a magic file in the other nodes' store dirs right after the down node acked the restart, and check for that marker file when seeing the error on the other nodes).
Btw, I think this may have gotten more frequent because now we gracefully drain nodes in acceptance/version-upgrade. This means that there is no temporary unavailability in the cluster which could've masked these kinds of failures. I think we can paper over this for now with a sleep; don't think this is a new regression.
Is this the same as https://github.com/cockroachdb/cockroach/issues/44101 ?
Looks like it.
Describe the problem
Extracted from an acceptance/version-upgrade failure^1. We're likely deflaking this test by inserting a sleep but should investigate at some point.
To Reproduce
Potentially by stressing version-upgrade at the SHA of ^1
Expected behavior
When a node has signaled readiness, it ought to be able to serve the FlowStream RPC. It's unclear if the problem is in the "other node's" RPC layer (is it holding on and using a connection to the peer from before the restart, that hasn't yet been torn down?) or whether the node is somehow serving requests before it is truly ready to do so, resulting in transient errors.
Artifacts attached.
debug.zip artifacts.zip
Note: putting this into Replication team because the KV area nominally owns RPC and I started looking into this.
Jira issue: CRDB-19450
Epic CRDB-39899