Open erikgrinaker opened 1 year ago
One alternative here could be to use a lower heartbeat timeout for the SystemClass connection, which is less susceptible to network congestion, and terminate all RPC connections to the node when the system connection fails.
RPC heartbeats are important to detect peer failures and fail over to other nodes. However, they currently need to have very high timeouts (6 seconds) because they can be head-of-line blocked by other RPC traffic. For example, an experiment with a 500ms RTT cluster running a TPCC import would frequently hit the 6 second heartbeat timeout, even though the network latency was a fraction of this.
Furthermore, on idle clusters heartbeats were occasionally seen to take 3 RTTs rather than 1, long after the connection had initially been established (which takes 3 RTTs for the handshake). It's unclear what the cause of this is -- packet dumps showed that the TCP connection was intact throughout, so further analysis is needed.
We should avoid head-of-line blocking and other interference with RPC heartbeats, to get them closer to the basic network RTT, such that we can reduce the heartbeat timeout further. This may require switching the gRPC transport to e.g. QUIC.
Jira issue: CRDB-22311