Closed knz closed 1 year ago
This is related to #70111, in that we don't currently attempt to form a full mesh network -- a node will only connect to a different node if some component needs to reach it. I don't know exactly if or how this impacts the proposal here, but it's possible that we may find ourself in a situation where a node connects to a peer but the peer does not attempt to connect back (unless we explicitly make it so).
we may find ourself in a situation where a node connects to a peer but the peer does not attempt to connect back (unless we explicitly make it so).
100% agree, hence the proposal above to:
When receiving an incoming connection from another node, refuse to respond to heartbeats until we get a successful outgoing dial to the same node. In other words: when n2 connects to n1, have n1 refuse the conn until it can dial back to n2 successfully.
Do we have ETA for this one? We recently experienced the very similar issue: when a host DNS was unavailable, but the one direction connection/dial still worked, the cluster rejected almost all the incoming client connections to the cluster. Thus, it caused a production outage. Thx
As seen in https://github.com/cockroachlabs/support/issues/1875, 22.2 includes two additional protections that may help here, by preventing lease transfers to nodes that aren't able to apply the lease:
These are particularly relevant for the case where a lease is transferred to a node which is able to establish outbound connections (and thus can successfully heartbeat), but can't receive inbound connections (in which case it won't catch up on the Raft log and can't apply the lease). Previously, this would cause the range to become unavailable. However, this doesn't help when leases are already present on the semi-partitioned node, so the proposal here is still relevant (in particular by preventing the node from heartbeating, assuming the liveness leaseholder is also one-way partitioned from the node).
Do we have ETA for this one?
No concrete ETA I'm afraid, but we're very aware of this class of problem.
A simpler thing than described here which feels worth doing is finding a way to at least observe and identify when one-way partitions do happen. Today we can't tell. The latency tracking code is happy to use just one direction of connection to update its view.
The latency tracking code is happy to use just one direction of connection to update its view.
Technically both sides of the connection maintain the latency metric for "their side" and we can see both in the UI (across the diagonal). We'd lose information if we merged them together.
In addition to this, we should require that all connection classes can be established. We have seen cases where nodes were able to establish one but not the other connection class due to a bug in the RPC circuit breakers.
Looking at solutions for this (both the initial and ongoing partitions), there are two general ways it could be done:
1) Don't send heartbeat PingResponse
unless a backward connection is set up. If the reverse connection cannot be established or broken, stop sending the PingResponse
messages. (what is described in the original proposal)
2) Add a field to PingResponse
called badReverseConnections
which specifies the list of reverse connections that are in an error state. A receiver of a PingResponse would treat a message with this field set as an error to the caller.
The second since it allows for differentiating why the heartbeats are not making it back and has more information such as the status of each connection class. It is not exactly clear what we should do if some connection classes are able to connect backward and others can't but at least we would have the information for logging/debugging purposes.
Regarding the gossip of the data and its usage for blocking. I agree it is useful to put the data in gossip, but I'm less clear about exactly how it should be used. If we are blocking certain connections based on "hearsay" then it seems possible that a bad node can cause a larger outage than it should (in the example, what if X is connected to both Y and Z and both report the other through gossip)? I'm not exactly sure about the conditions where this would happen, but it seems possible. Unless anyone feels strongly I wasn't going to use this "proxy" information initially, but it will be available in the logs / gossip dump.
I think we can start simple with 1 and not send ping responses. We can log an error on the rejecting node, which I think should be sufficient and wouldn't need any protocol changes.
As for the gossip aspect, let's defer that. This is related to global failure detection, and there are multiple conversations going on about this, so let's not implement anything until we have a clear picture of how failure detection should work in general (in particular wrt. leases).
The implementation of this was a little more complicated than I originally thought due to the behavior of the The original thought was that it would look something like this:
Time 0 - Node A connects to Node B
Time 1 - Node A calls `heartbeatClient.Ping` to Node B
Time 2 - Node B receives PingRequest from A - does not send PingResponse since no backward connection
Time 3 - Node B connects to Node A
Time 4 - Node B calls `heartbeatClient.Ping` to Node A
Time 5 - Node A receives PingRequest from B - notices that a connection attempt is in progress and waits for it
But this doesn't work because they are deadlocked. I was hoping to examine the state of the connection that is being used since there are really three "important" states 1) TCP connection set up 2) PingRequest sent 3) PingResponse received
However, there does not appear to be a reliable way to tell after either 1 or 2 is complete with the way our connections are set up today.
The approach I am looking at now is the following:
Time 0 - Node A connects to Node B
Time 1 - Node A calls `heartbeatClient.Ping` to Node B
Time 2 - Node B receives PingRequest from A - does not send PingResponse since no fully established backward connection
Time 3 - Node B connects to Node A on a new connection set up with `grpc.WithBlock()`
Time 4 - Node B waits until TCP connection is established
Time 5 - Node B sends a PingResponse to node A (on the original connection)
Time 6 - Node B closes the TCP connection it just established
The other alternative is to send an error back immediately on the first PingRequest but that has the unfortunate impact of also failing the first request sent and also not sending another PingResponse for a while.
Hi @andrewbaptist, please add branch-* labels to identify which branch(es) this release-blocker affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
Is your feature request related to a problem? Please describe.
Followup to #49220 / cockroachlabs/support#1690
We know of multiple situations where a cluster can find itself into an asymmetric partition, whtiich causes all kinds of symptoms (#49220), these include at least:
It would be good if we had some automation to exclude nodes which appear to be partially partitioned away. (and require operator attention)
Describe the solution you'd like
We could have a couple relatively simple mechanisms to protect a cluster:
Two point-to-point mechanisms, to protect against pairwise partions:
When receiving an incoming connection from another node, refuse to respond to heartbeats until we get a successful outgoing dial to the same node. In other words: when n2 connects to n1, have n1 refuse the conn until it can dial back to n2 successfully.
When an outgoing dial fails, or a heartbeat fails, keep a timestamp of the failure for the remote node ID, and when receiving a heartbeat from that ID, refuse to respond to the heartbeat if there's a recent failure (and possibly actively close the connection). In other words, start refusing a heartbeat from n2 to n1, if n1 has failed to connect/heartbeat to n2 recently.
Then a cluster-wisde mechanism, to protect against global partitions (e.g. n1-n2 and n2-n3 can connect, but not n1-n3)
Jira issue: CRDB-17572
gz#13169
Epic CRDB-2488