cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.89k stars 3.77k forks source link

kvserver: improve severe packet loss handling #110099

Open erikgrinaker opened 1 year ago

erikgrinaker commented 1 year ago

CockroachDB is fairly vulnerable to packet loss. To a large extent, we inherit this vulnerability from TCP: its in-order delivery guarantees means that a single lost packet will stall the entire stream. This is amplified by gRPC and HTTP/2 multiplexing several logical stream onto the same TCP connection, such that all streams stall if a single stream experiences packet loss.

A related problem is that under severe packet loss (e.g. 30%), RPC connections will keep flapping -- with sufficient packet loss, the RPC heartbeats will close the connection after a 6 second timeout, failing over to other nodes. However, if the packet loss is not severe enough to prevent future dials from succeeding, we'll shortly re-establish the connection only to hit more instability and eventually another 6 second timeout. Rinse and repeat, causing continued unavailability.

We should:

Somewhat related to #93397.

Jira issue: CRDB-31268

sean- commented 1 year ago

Quic or anything with FEC would be good. Kcp is worth considering for heartbeat messages, too.

ameya-deshmukh commented 1 year ago

@sean- can I take this up?

nvanbenschoten commented 1 year ago

Hi @ameya-deshmukh, thanks for offering! This change looks quite large and it's not clear exactly what we want to do here yet. Can I suggest https://github.com/cockroachdb/cockroach/issues/103839 as a good first issue for getting familiar with the CockroachDB code base?

ameya-deshmukh commented 1 year ago

For sure! Let me get started on it.