rpc: FlowStream error to recently restarted node

tbg commented 2 years ago

Describe the problem

Extracted from an acceptance/version-upgrade failure^1. We're likely deflaking this test by inserting a sleep but should investigate at some point.

To Reproduce

Potentially by stressing version-upgrade at the SHA of ^1

Expected behavior

When a node has signaled readiness, it ought to be able to serve the FlowStream RPC. It's unclear if the problem is in the "other node's" RPC layer (is it holding on and using a connection to the peer from before the restart, that hasn't yet been torn down?) or whether the node is somehow serving requests before it is truly ready to do so, resulting in transient errors.

Artifacts attached.

debug.zip artifacts.zip

Note: putting this into Replication team because the KV area nominally owns RPC and I started looking into this.

Jira issue: CRDB-19450

Epic CRDB-39899

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/replication

tbg commented 2 years ago

I think it would be helpful to reproduce this with the env var GRPC_GO_LOG_SEVERITY_LEVEL=info (or changing this^1 to info).

But I think this has to do with how the connection pool works. Not that it should work that way, but I think somehow what happens is that we have a connection in the pool (on n1, the not-restarted-node in the artifacts) that still looks "healthy" even though grpc's first dial has already hit an error (guessing: connection refused). The conn is then handed out here and survives the error check:

https://github.com/cockroachdb/cockroach/blob/2675c7c0481bb33eb9f43f1c2b52c152fca3ae2d/pkg/sql/colflow/colrpc/outbox.go#L176-L184

but when we actually try to make a FlowStream client and connect, we prompt a redial (or something like that) and run into the onlyOnceDialer (which is something that we use internally to make sure that gRPC always lets our rpc wrappers do a new connection; this has to do with version checks etc) here:

https://github.com/cockroachdb/cockroach/blob/2675c7c0481bb33eb9f43f1c2b52c152fca3ae2d/pkg/sql/colflow/colrpc/outbox.go#L186-L200

This doesn't all obviously check out, since there is a health check when pulling the connection from the pool,

https://github.com/cockroachdb/cockroach/blob/50c1196e50041bbe5d5584a4ad7f221613902213/pkg/rpc/nodedialer/nodedialer.go#L207-L216

and I don't know why we're getting past it.

It might be helpful to not try to reproduce this the full version-upgrade test, but simply a regular test where we repeatedly drain and instantly restart a node, then run a distributed query on it.

Or maybe even better, add some ad-hoc code to a build that verifies an invariant that we want: that once the restart of the down node is acked, we immediately no longer see any of these connection errors on the other nodes. (Needs some tricky to make it work: probably best to run in local mode, and to touch a magic file in the other nodes' store dirs right after the down node acked the restart, and check for that marker file when seeing the error on the other nodes).

Btw, I think this may have gotten more frequent because now we gracefully drain nodes in acceptance/version-upgrade. This means that there is no temporary unavailability in the cluster which could've masked these kinds of failures. I think we can paper over this for now with a sleep; don't think this is a new regression.

andreimatei commented 1 year ago

Is this the same as https://github.com/cockroachdb/cockroach/issues/44101 ?

tbg commented 1 year ago

Looks like it.

cockroachdb / cockroach

rpc: FlowStream error to recently restarted node #87634