rpc: strenghthen the behavior of CockroachDB in "fail slow" TCP connect timeouts

knz commented 4 years ago

The RPC layer relies on the network reporting when a TCP connection fails, to decide that the other side is unreachable.

For example, this logic is used to determine when connectivity to a node is lost, or to skip over a replica while discovering a leaseholder.

TLDR: Today, the CockroachDB logic is optimized for the "fail fast" scenario, and we have virtually no testing for "fail slow" scenarios.

Background

Depending on the network configuration connections can fail in two ways:

they can fail fast, with a TCP RST sent immediately in response to the TCP SYN.

This results in the well-known "Connection refused" error. This is the default network configuration in most OSes when the target IP address is valid, but there is no service listening on the desired port.

(they can also fail fast if there is a non-crdb network service at the remote address, in which case the TLS handshake fails quickly.)
however, they will fail slowly if there is no host at the target IP address, or if a firewall rule indicates to DROP traffic to the target address/port pair.

This results in a TCP handshake that lingers for multiple seconds, while the client network stack waits for a TCP packet in response to SYN requests.

Today, the CockroachDB logic is optimized for the "fail fast" scenario, and we have virtually no testing for "fail slow" scenarios.

"Fail slow" in practice

In practice, we have anecdotal reports of customers/users who encounter performance blips and transient cluster unavailability because they encounter a "fail slow" situation.

The reason for this is that CockroachDB internally uses a timeout to detect connection errors; the timeout is set to multiple seconds, because we cannot use a small timeout (a small timeout would create spurious errors when there is a legitimate network blip, which are common in Clouds).

"Fail slow" situations arise "naturally", for example in the following circumstances:

the operator installs a firewall and mistakenly sets it to filter node-node traffic.
the operator moves a CockroachDB to a new IP address, with no host server left at the old IP address.
the k8s orchestration configuration is changed to use a new network prefix.

Strategy

We should document this difference in behavior and invite operators to actively set up their network to achieve "fail fast" in the common case.

In particular, a callout should be added to docs when migrating a node to a new machine, to keep a server listening at the previous IP address until the cluster learns of the new topology.
We should add additional testing for "fail slow" scenarios, and inventory the cases where CockroachDB currently misbehaves.
We should document the particular symptoms of encountering this issue.

gz#8203

gz#8949

Epic: CRDB-8500

Jira issue: CRDB-3869

knz commented 4 years ago

@johnrk this is the situation we talked about yesterday.

knz commented 4 years ago

@tbg I'd like to add this to "KV problems to solve" too thanks

bdarnell commented 4 years ago

the timeout is set to multiple seconds, because we cannot use a small timeout (a small timeout would create spurious errors when there is a legitimate network blip, which are common in Clouds).

The current 3s timeout is pretty conservative and could be reduced pretty safely (maybe to 1s).

Are "legitimate network blips" really that common? And even if they are, routing traffic away from the link that experienced a blip may not be a bad thing (although if we make the timeout too aggressive, we may want to make the response weaker - pick a different connection if available instead of tearing down the connection and forcing a new TCP+TLS handshake next time).

knz commented 3 years ago

A customer is reporting that they see 5-minute long latency increases when they put one of their nodes into a black hole (by removing the VM from orchestration).

knz commented 3 years ago

@bdarnell explained that we have the circuitbreaker abstraction in the source code, from this package: https://github.com/cockroachdb/circuitbreaker

However as per that package's docs:

When a circuit breaker is tripped any future calls will avoid making the remote call and return an error to the caller. In the meantime, the circuit breaker will periodically allow some calls to be tried again and will close the circuit if those are successful.

(emphasis mine)

So it is possible for some calls to fail-slow if the network takes a while to refuse the connection. We'd need to investigate whether these occasional circuitbreaker fail-slow errors can impact KV latency.

knz commented 3 years ago

Also we do not have tests about latency impact upon "network fail-slow" scenarios.

To implement such a test, we'd need to create firewall rules to simulate a network partition / black hole. Some of our Jepsen tests use similar logic internally, so we can perhaps take inspiration from that. (Note: the jepsen tests do not exercise latency, only correctness).

rafiss commented 3 years ago

I found https://github.com/cockroachdb/cockroach/pull/33282 -- it seems like we could use the work from it it to implement testing for slow networks and network black holes.

Also the discussion in https://github.com/cockroachdb/cockroach/issues/21536 is relevant and pointed me to these gRPC docs:

WaitForReady configures the action to take when an RPC is attempted on broken connections or unreachable servers. If waitForReady is false, the RPC will fail immediately. Otherwise, the RPC client will block the call until a connection is available (or the call is canceled or times out) and will retry the call if it fails due to a transient error. gRPC will not retry if data was written to the wire unless the server indicates it did not process the data. Please refer to https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md.

By default, RPCs don't "wait for ready".

tbg commented 3 years ago

related internal incidence report where this likely made things worse: https://cockroachlabs.atlassian.net/browse/SREOPS-2934

erikgrinaker commented 3 years ago

In an internal support investigation, the following issues were found to be the primary causes of prolonged unavailability in the presence of unresponsive peers:

cockroachdb / cockroach