cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.01k stars 3.79k forks source link

catalog/lease: avoid updating sqlliveness cannot be written or deleted #119654

Closed fqazi closed 7 months ago

fqazi commented 7 months ago

In https://github.com/cockroachdb/cockroach/issues/119347 we noticed that if writes / deletes to the system.lease keep hitting retryable replica errors we should eventually stop renewing the sqlliveness once enough time has gone by. Its possible for the system.lease table to have partition failure where the sqlliveness can be emitted, but leases cannot be modified / deleted. This could cause the entire cluster from being able to execute new schema changes, since this node will be stuck holding leases.

In the expiry leasing model this condition would clear once the lease expires, but in the session based model the lease could linger forever.

Jira issue: CRDB-36217

Epic CRDB-35665

andrewbaptist commented 7 months ago

I don't think the node should be killed as this has larger cluster impacts. Instead, it should only update sqlliveness if it is able to update all the system.lease records. If any of the updates to system.lease fail, then it should not update sqlliveness. The right model is that the liveness depends on its ability to do "everything necessary" and if any of them fail then it should also fail liveness.

For a SQL only node it might be OK to also fail the node, but preferably it should disable its sql endpoint so it is removed from the load balancer. I'm looking at Instance.extendSession and its not clear if this code is correct today. It attempts to extend the session "forever" even though it might have passed the TTL. I think it would be better to eventually "give up" if it hasn't been able to extend its session and the previous one has already expired.

With these two changes 1) Don't extend the sql session unless "everything else" succeeds. 2) Once the sql session timeout has passed, turn off the SQL interface.

It will better handle some of the partition cases.

blathers-crl[bot] commented 7 months ago

Hi @rimadeodhar, please add branch-* labels to identify which branch(es) this release-blocker affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.