Open nvanbenschoten opened 2 years ago
cc @cockroachdb/replication
Hi @nvanbenschoten ! I am Anuj Diwan, a Computer Science PhD student at UT Austin. I am part of a team along with @arjunrs1 (Arjun Somayazulu) and we're taking a graduate Distributed Systems course. For our course project, we are interested in contributing to CockroachDB. This issue is related to our course material. Could we work on this issue? Any pointers for us to get started would be appreciated as well.
Thanks and regards, Anuj.
In https://github.com/cockroachdb/cockroach/commit/8aa1c140eef574869dc70076987a3f12e19b7c3d, we added protection against replicas that were behind on their log attempting to acquire the lease and stalling. That commit did this by requiring lease requestors to be the Raft leader, which ensured that they were up-to-date on their log.
In https://github.com/cockroachdb/cockroach/commit/a767cdda788abbb7dd6a2e3b52df0d867f0b9bf8, we weakened this protection to work around cases where the current Raft leader could not acquire the lease. This resolved a few deadlocks, which are described in that commit message.
The deadlocks are real issues, but the resolution to ignore the Raft leadership status in these cases is problematic. At worst, it undermines the protection granted by https://github.com/cockroachdb/cockroach/commit/8aa1c140eef574869dc70076987a3f12e19b7c3d and permits risky lease requests.
A safer alternative would be for the replica that determines that the Raft leader is unsuitable to hold the lease to call a Raft (pre-vote) election and try to take leadership. If it succeeds, it can acquire the lease. If it fails, it wasn't a good candidate to hold the lease. This is the same strategy we employed in https://github.com/cockroachdb/cockroach/pull/87244.
Jira issue: CRDB-20174