cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.91k stars 3.78k forks source link

kvserver: use shorter lease expiration for liveness range #88443

Open erikgrinaker opened 2 years ago

erikgrinaker commented 2 years ago

If the liveness range leaseholder is lost, the range may be unavailable for long enough that all other leaseholders also lose their epoch-based lease, since they all have the same lease expiration time. We should use a shorter lease expiration interval for the liveness range, to ensure that in the typical case, a non-cooperative lease transfer can happen without disrupting other leases.

Relates to #41162. Jira issue: CRDB-19826

Epic CRDB-40200

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/replication

jjathman commented 1 year ago

a non-cooperative lease transfer can happen without disrupting other leases

Can you explain a bit more about how things should work and how this problem may manifest? We had an issue in one of our clusters where the liveness lease holder lost network connectivity, which seemed to cause the entire cluster to lock up and quit allowing any DB connections until the node which was the liveness leaseholder was restarted. The entire cluster was unusable for over 24 hours (this was a dev cluster so it was not immediately noticed).

erikgrinaker commented 1 year ago

Practically all leases in the system are what's called epoch-based leases, and these are tied to node heartbeats. If a node fails to heartbeat for some time, it loses all of its leases. Node heartbeats are essentially a write to the liveness range. If the liveness range leaseholder is lost, the liveness range will be unavailable until the lease interval (~10s) expires. During this time, all node heartbeats will fail, which may cause other nodes to lose their leases as well.

This will manifest as many/most leases in the system being invalidated (causing range unavailability) following loss of the liveness leaseholder. This will last until the liveness lease is reacquired, at which point other nodes can acquire the remaining invalid leases -- ideally about 10 seconds, but can take longer due to various other interactions.

This issue is specifically about avoiding this unavailability blip, by ensuring the liveness lease can be reacquired fast enough that other nodes won't lose their leases in the meanwhile. The problem you're describing sounds different, in that the outage persisted. It may e.g. be related to partial network partitions or node unresponsiveness, which we've seen cause these symptoms:

I see that we have an RCA in progress for this outage. That should shed some light on the specific failure mode here.

jjathman commented 1 year ago

Thank you for the links. It does seem like the symptoms of our outage are more closely aligned with what you've posted. We did get a couple log messages on the node that was the liveness lease holder about disk stall problems right before this issue manifested so maybe that's what we've hit.

erikgrinaker commented 1 year ago

We did some experiments over in #93073, and this probably isn't viable because we'd have to drop the Raft election timeout extremely low -- so low that it'd likely destabilize multiregion clusters with high latencies.

We should consider other approaches to avoiding the impact of liveness range unavailability, e.g. storing node liveness info (or rather, coalesced lease extensions) somewhere else, possibly sharded.