rpc: refuse incoming connections unless dialback succeeds

knz commented 2 years ago

Is your feature request related to a problem? Please describe.

Followup to #49220 / cockroachlabs/support#1690

We know of multiple situations where a cluster can find itself into an asymmetric partition, whtiich causes all kinds of symptoms (#49220), these include at least:

low-level network partitions, when e.g IP packets get dropped in one direction but not the other
firewall partitions, when e.g. existing TCP sessions stay alive but new TCP sessions cannot be established
TLS misconfigurations, when two nodes use different client/server certs, so that the TLS handshake succeeds in one direction but fails in the other
DNS misconfigurations, where the advertised hostname of a node is not resolvable from (some) other nodes

It would be good if we had some automation to exclude nodes which appear to be partially partitioned away. (and require operator attention)

Describe the solution you'd like

We could have a couple relatively simple mechanisms to protect a cluster:

Two point-to-point mechanisms, to protect against pairwise partions:

When receiving an incoming connection from another node, refuse to respond to heartbeats until we get a successful outgoing dial to the same node. In other words: when n2 connects to n1, have n1 refuse the conn until it can dial back to n2 successfully.
When an outgoing dial fails, or a heartbeat fails, keep a timestamp of the failure for the remote node ID, and when receiving a heartbeat from that ID, refuse to respond to the heartbeat if there's a recent failure (and possibly actively close the connection). In other words, start refusing a heartbeat from n2 to n1, if n1 has failed to connect/heartbeat to n2 recently.

Then a cluster-wisde mechanism, to protect against global partitions (e.g. n1-n2 and n2-n3 can connect, but not n1-n3)

if one of the two mechanisms above decides to exclude an ID, they would also register a gossip entry (with suitable TTL) that says the node is potentially blocked, containing the IDs of both the blocked node and the decider node. Then, on every node, use these gossip entries as follows: if a node X connects to node Y, and node Y is in a "blocked node" entry created by node Z, and node X has a valid connection to the node Z (i.e. X and Z are in the same side of the partition), then have X proactively start blocking Y too.

Jira issue: CRDB-17572

gz#13169

Epic CRDB-2488

erikgrinaker commented 2 years ago

This is related to #70111, in that we don't currently attempt to form a full mesh network -- a node will only connect to a different node if some component needs to reach it. I don't know exactly if or how this impacts the proposal here, but it's possible that we may find ourself in a situation where a node connects to a peer but the peer does not attempt to connect back (unless we explicitly make it so).

knz commented 2 years ago

we may find ourself in a situation where a node connects to a peer but the peer does not attempt to connect back (unless we explicitly make it so).

100% agree, hence the proposal above to:

When receiving an incoming connection from another node, refuse to respond to heartbeats until we get a successful outgoing dial to the same node. In other words: when n2 connects to n1, have n1 refuse the conn until it can dial back to n2 successfully.

zhoufang-joe commented 2 years ago

Do we have ETA for this one? We recently experienced the very similar issue: when a host DNS was unavailable, but the one direction connection/dial still worked, the cluster rejected almost all the incoming client connections to the cluster. Thus, it caused a production outage. Thx

erikgrinaker commented 2 years ago

As seen in https://github.com/cockroachlabs/support/issues/1875, 22.2 includes two additional protections that may help here, by preventing lease transfers to nodes that aren't able to apply the lease:

82758
85629

These are particularly relevant for the case where a lease is transferred to a node which is able to establish outbound connections (and thus can successfully heartbeat), but can't receive inbound connections (in which case it won't catch up on the Raft log and can't apply the lease). Previously, this would cause the range to become unavailable. However, this doesn't help when leases are already present on the semi-partitioned node, so the proposal here is still relevant (in particular by preventing the node from heartbeating, assuming the liveness leaseholder is also one-way partitioned from the node).

Do we have ETA for this one?

No concrete ETA I'm afraid, but we're very aware of this class of problem.

ajwerner commented 2 years ago

A simpler thing than described here which feels worth doing is finding a way to at least observe and identify when one-way partitions do happen. Today we can't tell. The latency tracking code is happy to use just one direction of connection to update its view.

knz commented 2 years ago

The latency tracking code is happy to use just one direction of connection to update its view.

Technically both sides of the connection maintain the latency metric for "their side" and we can see both in the UI (across the diagonal). We'd lose information if we merged them together.

erikgrinaker commented 1 year ago

In addition to this, we should require that all connection classes can be established. We have seen cases where nodes were able to establish one but not the other connection class due to a bug in the RPC circuit breakers.

andrewbaptist commented 1 year ago

Looking at solutions for this (both the initial and ongoing partitions), there are two general ways it could be done: 1) Don't send heartbeat PingResponse unless a backward connection is set up. If the reverse connection cannot be established or broken, stop sending the PingResponse messages. (what is described in the original proposal) 2) Add a field to PingResponse called badReverseConnections which specifies the list of reverse connections that are in an error state. A receiver of a PingResponse would treat a message with this field set as an error to the caller.

The second since it allows for differentiating why the heartbeats are not making it back and has more information such as the status of each connection class. It is not exactly clear what we should do if some connection classes are able to connect backward and others can't but at least we would have the information for logging/debugging purposes.

Regarding the gossip of the data and its usage for blocking. I agree it is useful to put the data in gossip, but I'm less clear about exactly how it should be used. If we are blocking certain connections based on "hearsay" then it seems possible that a bad node can cause a larger outage than it should (in the example, what if X is connected to both Y and Z and both report the other through gossip)? I'm not exactly sure about the conditions where this would happen, but it seems possible. Unless anyone feels strongly I wasn't going to use this "proxy" information initially, but it will be available in the logs / gossip dump.

erikgrinaker commented 1 year ago

I think we can start simple with 1 and not send ping responses. We can log an error on the rejecting node, which I think should be sufficient and wouldn't need any protocol changes.

As for the gossip aspect, let's defer that. This is related to global failure detection, and there are multiple conversations going on about this, so let's not implement anything until we have a clear picture of how failure detection should work in general (in particular wrt. leases).

andrewbaptist commented 1 year ago

The implementation of this was a little more complicated than I originally thought due to the behavior of the The original thought was that it would look something like this:

Time 0 - Node A connects to Node B
Time 1 - Node A calls `heartbeatClient.Ping` to Node B
Time 2 - Node B receives PingRequest from A - does not send PingResponse since no backward connection
Time 3 - Node B connects to Node A
Time 4 - Node B calls `heartbeatClient.Ping` to Node A
Time 5 - Node A receives PingRequest from B - notices that a connection attempt is in progress and waits for it

But this doesn't work because they are deadlocked. I was hoping to examine the state of the connection that is being used since there are really three "important" states 1) TCP connection set up 2) PingRequest sent 3) PingResponse received

However, there does not appear to be a reliable way to tell after either 1 or 2 is complete with the way our connections are set up today.

The approach I am looking at now is the following:

Time 0 - Node A connects to Node B
Time 1 - Node A calls `heartbeatClient.Ping` to Node B
Time 2 - Node B receives PingRequest from A - does not send PingResponse since no fully established backward connection
Time 3 - Node B connects to Node A on a new connection set up with `grpc.WithBlock()`
Time 4 - Node B waits until TCP connection is established
Time 5 - Node B sends a PingResponse to node A (on the original connection)
Time 6 - Node B closes the TCP connection it just established

The other alternative is to send an error back immediately on the first PingRequest but that has the unfortunate impact of also failing the first request sent and also not sending another PingResponse for a while.

blathers-crl[bot] commented 1 year ago

Hi @andrewbaptist, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroachdb / cockroach

rpc: refuse incoming connections unless dialback succeeds #84289

82758

85629