cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

rpc: strenghthen the behavior of CockroachDB in "fail slow" TCP connect timeouts #53410

Open knz opened 4 years ago

knz commented 4 years ago

The RPC layer relies on the network reporting when a TCP connection fails, to decide that the other side is unreachable.

For example, this logic is used to determine when connectivity to a node is lost, or to skip over a replica while discovering a leaseholder.

TLDR: Today, the CockroachDB logic is optimized for the "fail fast" scenario, and we have virtually no testing for "fail slow" scenarios.

Background

Depending on the network configuration connections can fail in two ways:

Today, the CockroachDB logic is optimized for the "fail fast" scenario, and we have virtually no testing for "fail slow" scenarios.

"Fail slow" in practice

In practice, we have anecdotal reports of customers/users who encounter performance blips and transient cluster unavailability because they encounter a "fail slow" situation.

The reason for this is that CockroachDB internally uses a timeout to detect connection errors; the timeout is set to multiple seconds, because we cannot use a small timeout (a small timeout would create spurious errors when there is a legitimate network blip, which are common in Clouds).

"Fail slow" situations arise "naturally", for example in the following circumstances:

Strategy

gz#8203

gz#8949

Epic: CRDB-8500

Jira issue: CRDB-3869

knz commented 4 years ago

@johnrk this is the situation we talked about yesterday.

knz commented 4 years ago

@tbg I'd like to add this to "KV problems to solve" too thanks

bdarnell commented 4 years ago

the timeout is set to multiple seconds, because we cannot use a small timeout (a small timeout would create spurious errors when there is a legitimate network blip, which are common in Clouds).

The current 3s timeout is pretty conservative and could be reduced pretty safely (maybe to 1s).

Are "legitimate network blips" really that common? And even if they are, routing traffic away from the link that experienced a blip may not be a bad thing (although if we make the timeout too aggressive, we may want to make the response weaker - pick a different connection if available instead of tearing down the connection and forcing a new TCP+TLS handshake next time).

knz commented 3 years ago

A customer is reporting that they see 5-minute long latency increases when they put one of their nodes into a black hole (by removing the VM from orchestration).

knz commented 3 years ago

@bdarnell explained that we have the circuitbreaker abstraction in the source code, from this package: https://github.com/cockroachdb/circuitbreaker

However as per that package's docs:

When a circuit breaker is tripped any future calls will avoid making the remote call and return an error to the caller. In the meantime, the circuit breaker will periodically allow some calls to be tried again and will close the circuit if those are successful.

(emphasis mine)

So it is possible for some calls to fail-slow if the network takes a while to refuse the connection. We'd need to investigate whether these occasional circuitbreaker fail-slow errors can impact KV latency.

knz commented 3 years ago

Also we do not have tests about latency impact upon "network fail-slow" scenarios.

To implement such a test, we'd need to create firewall rules to simulate a network partition / black hole. Some of our Jepsen tests use similar logic internally, so we can perhaps take inspiration from that. (Note: the jepsen tests do not exercise latency, only correctness).

rafiss commented 3 years ago

I found https://github.com/cockroachdb/cockroach/pull/33282 -- it seems like we could use the work from it it to implement testing for slow networks and network black holes.

Also the discussion in https://github.com/cockroachdb/cockroach/issues/21536 is relevant and pointed me to these gRPC docs:

WaitForReady configures the action to take when an RPC is attempted on broken connections or unreachable servers. If waitForReady is false, the RPC will fail immediately. Otherwise, the RPC client will block the call until a connection is available (or the call is canceled or times out) and will retry the call if it fails due to a transient error. gRPC will not retry if data was written to the wire unless the server indicates it did not process the data. Please refer to https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md.

By default, RPCs don't "wait for ready".

tbg commented 3 years ago

related internal incidence report where this likely made things worse: https://cockroachlabs.atlassian.net/browse/SREOPS-2934

erikgrinaker commented 3 years ago

In an internal support investigation, the following issues were found to be the primary causes of prolonged unavailability in the presence of unresponsive peers: