Open irfansharif opened 1 year ago
cc @cockroachdb/replication
Looks like a couple of clear problems here: don't apply circuit breakers to destroyed replicas, and make replica GC more aggressive.
@nicktrav for visibility / prioritization of this o-support issue
Update: we're not going to get a chance to work on this in 23.2. Leaving open and in our backlog.
Looks like a couple of clear problems here: don't apply circuit breakers to destroyed replicas, and make replica GC more aggressive.
Hey @erikgrinaker, I want to help on this issue. I have a change on don't apply circuit breakers to destroyed replicas with some questions inside the pr. I wonder if you can help me understand more about make replica GC more aggressive.
with more details. No rush on this and thanks in advance.
Describe the problem
In an internal support case (https://github.com/cockroachlabs/support/issues/2346) we observed the following pattern:
That is, it was merged away about 13 days before its replica circuit breaker starting firing. It stopped firing after a very delayed GC of the replica. It had started firing after an attempt to request a lease, which in turn happened because it was enqueued in another KV queue that required leases in order to be processed. This makes for a benign but "false positive" circuit breaker tripped error. It also increments metrics that operators use to indicate CRDB health.
Jira issue: CRDB-28604
Epic CRDB-39898