cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.96k stars 3.79k forks source link

syncutil: watch mutexes for deadlock #66765

Open tbg opened 3 years ago

tbg commented 3 years ago

Is your feature request related to a problem? Please describe.

We just dealt with a deadlock due to a re-entrant RLock acquisition (https://github.com/cockroachdb/cockroach/pull/66760). Deadlocks should be weeded out by testing, no question. But we should also expose them when they happen, as restarting the node on which they occur can be an effective mitigation tool.

Describe the solution you'd like

Bare bones

Enhanced

We could give each mutex a (static, i.e. one name shared across all Replica.mu for example) name and maintain, in addition to the above, a family of histograms of the latency of Lock() for each name (with somewhat tricky semantics, we want to make sure that if we deadlock on Lock() the histogram still gets populated, we can do something here, tbd. When a mutex deadlocks, metrics would hence show us which one. We probably don't want to hard-code the metrics names into timeseries names, which is what our internal timeseries would want. So this would be a prometheus-only metric. There is precedent for that in the tenant metrics, so this isn't new.

Describe alternatives you've considered There are many variations on the above. We just need to pick one and do it.

Additional context

See #66760

Jira issue: CRDB-8219

joshimhoff commented 3 years ago

This is a great idea for a time-series metric!!

We just dealt with a deadlock due to a re-entrant RLock acquisition (#66760). Deadlocks should be weeded out by testing, no question. But we should also expose them when they happen, as restarting the node on which they occur can be an effective mitigation tool.

Let's say that the long held mutex is leading a single range to be unavailable. Can we find a way to page on the range being unavailable? I'd rather page on some bad symptom than on a mutex being held too long (could have lower priority alert on the mutex being held too long). kvprober would eventually hit the problem range, but that takes a long time depending on cluster size, as you know. Although I sometimes wonder if we should just probe a lot, esp. in serverless setting where CRL owns the host cluster rather than the customer. Can we use a mutex being held too long as a hint to initiate probing? Sounds hard / wrong as then you need to be able to tie the mutex to a range... Would the stuck applied index alert fire in case of https://github.com/cockroachdb/cockroach/pull/66760? Or would the watch-for-stuck-applied-index code be stuck due to the deadlock also?

tbg commented 3 years ago

Would the stuck applied index alert fire in case of #66760?

It should, if we write the code in the right way. Mutex deadlocks are particularly pernicious. They are a low-level failure that should basically never occur, as its fallout is really hard to control and because they are not recoverable (there isn't even a way to "try to acquire the mutex but if it takes too long stop doing it". So whatever control mechanism we have, if it at any point touches the mutex in a blocking way, it itself gets stuck and won't report. So you always need to offload the mutex-touching to a goroutine and then wait for that goroutine to indicate success back to some controller goroutine, which is also the pattern I suggested above. But then a deadlock in the right mutex will prevent you from scraping the prometheus endpoint (to populate the prometheus metrics you touch the replica mutex, oops, though we can fix that too using the same pattern) so the whole idea may not be too useful in practice, at least not if we're primarily doing it for the replica mutexes.

I tend to think that it's not worth trying to let these metrics delve into debugging too much. This creates lots of engineering problems that don't pay off given how unique each deadlock scenario is. The key point is realizing quickly that there is one so that we can react by pulling stacks & cycling the node. RCA comes later, and it seldom matters which replica was hit by the deadlock as it can usually occur on any replica in general.

joshimhoff commented 3 years ago

It should, if we write the code in the right way.

Ack & thanks for the color. Also maybe the proposed slow mutex metric will be high enough signal to noise that we can page on it.

I tend to think that it's not worth trying to let these metrics delve into debugging too much.

Mhm +1.

tbg commented 3 years ago

We had another incident of a known Replica.mu deadlock today (user on 21.1.3, fixed in 21.1.4). These are really bad. The node will limp along in unpredictable ways. In the instance today, all raft scheduler goroutines got glued to the mutex so the node was not meaningfully participating in replication any more. On top of detecting such issues, nodes should consider crashing, at least when some mutexes seem to be deadlocked.

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/test-eng

tbg commented 1 year ago

Some related POC code which I used a in a repro for a deadlock (deadlock never reproed) is here. The important bit in the approach taken in that code is that we need to know the stack of the caller when we detect the slow mutex, which isn't always the case:

r.mu.Lock()
if foo() { return }
r.mu.Unlock() // oops leaked mutex if we return above

For lots of hot mutexes the approach in #106254 is probably too costly, but maybe not and definitely not for all. For example, for raftMu it's likely fine, since raftMu is held across IO anyway in the common case.