Open erikgrinaker opened 2 years ago
cc @cockroachdb/replication
a non-cooperative lease transfer can happen without disrupting other leases
Can you explain a bit more about how things should work and how this problem may manifest? We had an issue in one of our clusters where the liveness lease holder lost network connectivity, which seemed to cause the entire cluster to lock up and quit allowing any DB connections until the node which was the liveness leaseholder was restarted. The entire cluster was unusable for over 24 hours (this was a dev cluster so it was not immediately noticed).
Practically all leases in the system are what's called epoch-based leases, and these are tied to node heartbeats. If a node fails to heartbeat for some time, it loses all of its leases. Node heartbeats are essentially a write to the liveness range. If the liveness range leaseholder is lost, the liveness range will be unavailable until the lease interval (~10s) expires. During this time, all node heartbeats will fail, which may cause other nodes to lose their leases as well.
This will manifest as many/most leases in the system being invalidated (causing range unavailability) following loss of the liveness leaseholder. This will last until the liveness lease is reacquired, at which point other nodes can acquire the remaining invalid leases -- ideally about 10 seconds, but can take longer due to various other interactions.
This issue is specifically about avoiding this unavailability blip, by ensuring the liveness lease can be reacquired fast enough that other nodes won't lose their leases in the meanwhile. The problem you're describing sounds different, in that the outage persisted. It may e.g. be related to partial network partitions or node unresponsiveness, which we've seen cause these symptoms:
I see that we have an RCA in progress for this outage. That should shed some light on the specific failure mode here.
Thank you for the links. It does seem like the symptoms of our outage are more closely aligned with what you've posted. We did get a couple log messages on the node that was the liveness lease holder about disk stall problems right before this issue manifested so maybe that's what we've hit.
We did some experiments over in #93073, and this probably isn't viable because we'd have to drop the Raft election timeout extremely low -- so low that it'd likely destabilize multiregion clusters with high latencies.
We should consider other approaches to avoiding the impact of liveness range unavailability, e.g. storing node liveness info (or rather, coalesced lease extensions) somewhere else, possibly sharded.
If the liveness range leaseholder is lost, the range may be unavailable for long enough that all other leaseholders also lose their epoch-based lease, since they all have the same lease expiration time. We should use a shorter lease expiration interval for the liveness range, to ensure that in the typical case, a non-cooperative lease transfer can happen without disrupting other leases.
Relates to #41162. Jira issue: CRDB-19826
Epic CRDB-40200