Open erikgrinaker opened 1 year ago
Quic or anything with FEC would be good. Kcp is worth considering for heartbeat messages, too.
@sean- can I take this up?
Hi @ameya-deshmukh, thanks for offering! This change looks quite large and it's not clear exactly what we want to do here yet. Can I suggest https://github.com/cockroachdb/cockroach/issues/103839 as a good first issue for getting familiar with the CockroachDB code base?
For sure! Let me get started on it.
CockroachDB is fairly vulnerable to packet loss. To a large extent, we inherit this vulnerability from TCP: its in-order delivery guarantees means that a single lost packet will stall the entire stream. This is amplified by gRPC and HTTP/2 multiplexing several logical stream onto the same TCP connection, such that all streams stall if a single stream experiences packet loss.
A related problem is that under severe packet loss (e.g. 30%), RPC connections will keep flapping -- with sufficient packet loss, the RPC heartbeats will close the connection after a 6 second timeout, failing over to other nodes. However, if the packet loss is not severe enough to prevent future dials from succeeding, we'll shortly re-establish the connection only to hit more instability and eventually another 6 second timeout. Rinse and repeat, causing continued unavailability.
We should:
Add test suites for varying degrees of packet loss, with appropriate pass criteria (these will currently always fail with intermediate amounts).
Consider hedging reads across multiple replicas: https://github.com/cockroachdb/cockroach/issues/109320
Consider moving to QUIC once gRPC supports it, which limits the impact to affected streams. Requires upstream gRPC support, see https://github.com/grpc/grpc/issues/19126.
Monitor RPC connections for flapping, and hard-fail them until the flapping resolves. The circuit breaker infrastructure needed for this was mostly added for 23.2 in #99191. However, the heuristics here can be tricky -- in small clusters that require the unstable connection to maintain quorum, this may make a bad problem worse by basically taking the entire cluster offline.
Somewhat related to #93397.
Jira issue: CRDB-31268