Open tbg opened 2 years ago
cc @mwang1026 heads up. We originally attempted to address this problem with the circuit breakers too, but have refocused for loss of quorum because that allows us to do a lot better for follower reads (not blocking them when replication is down). Having typed out this issue it also doesn't seem obvious what the solution for the general problem of moving traffic off a "stuck replica" is, at least deadlock mitigation seems really thorny and wouldn't be handled satisfactorily by a circuit breaker.
The current behavior of a mutex deadlock swelling to deadlock processing across an entire node or even an entire cluster, all while being entirely opaque to a would-be debugger, is terrible. And yet, gracefully living with/cordoning off mutex deadlocks seems like a very hard problem. Depending on which mutex hits a deadlock, it's difficult to understand the full scope of the operations that will also transitively get caught up in the deadlock. It's also not clear what the best recourse is for each of these operations to recover — meaning that it would be a lot of work to generalize this and the solution would still likely be limited to specific mutexes.
Have we considered a less graceful means of detecting and limiting the blast radius of mutex deadlocks? For instance, assuming we could detect mutex deadlocks without false positives, crashing the node that hit the deadlock (with sufficient debug information to help engineers diagnose the situation after the fact) would be a step in the right direction.
https://github.com/cockroachdb/cockroach/issues/66765 comes to mind. If we tracked our mutexes and checked every so often that they can be acquired with ~reasonable delay (for example, 10s) we would get very close.
I share your concerns about a generalized solution via the circuit breaker.
Instead of requiring active cooperation of a faulty node, the DistSender and lease protocol should instead be robust to faulty replicas. This will be handled by expiration-based leases and DistSender lease detection and request redirection (https://github.com/cockroachdb/cockroach/issues/105168).
I'll leave this open in case that doesn't pan out, or we need this for other reasons.
Is your feature request related to a problem? Please describe.
The work in #33007 has given us good blast radius mitigations should a replica be unable to serve requests as a result of a loss of quorum. However, a replica can also become unavailable for other reasons, the most drastic of them being an inability to acquire a given mutex (e.g. a deadlock), but there could be others too.
Describe the solution you'd like
We could add a circuit breaker at the top of
Replica.Send
and trip it appropriately when the replica is "stuck", which we would also need to add ways to detect.Additional context
https://github.com/cockroachdb/cockroach/pull/72092 could be helpful to determine when to trip. Also, I want to point out that if any replica mutex actually deadlocks, this will likely deadlock the entire store and then the node, so doing this project to specifically address deadlocks is likely not a good use of our time. However, the circuit breaker proposed here could play a part in https://github.com/cockroachdb/cockroach/issues/75944.
Jira issue: CRDB-13552