cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.02k stars 3.79k forks source link

kvserver: false positive "circuit breaker tripped" errors/metrics #104567

Open irfansharif opened 1 year ago

irfansharif commented 1 year ago

Describe the problem

In an internal support case (https://github.com/cockroachlabs/support/issues/2346) we observed the following pattern:

I230427 00:08:10.977333 1234062151 kv/kvserver/replica_command.go:698 ⋮ [n3,merge,s3,r1531/2:‹› [(n6,s6):8, (n3,s3):7, (n1,s1):3, next=9, gen=126] into this range (‹lhs+rhs has (size=0 B+0 B=0 B qps=0.90+0.26=1.16qps) below threshold (size=128 MiB, qps=1250.00)›)
I230427 00:08:11.030688 241 kv/kvserver/store_remove_replica.go:150 ⋮ [n3,s3,r1531/2:‹/Table/143/1/16823{778…-809…}›,raft] 33557 removing replica r1528/7
I230427 00:08:11.034419 249 kv/kvserver/store_remove_replica.go:150 ⋮ [n1,s1,r1531/3:‹/Table/143/1/16823{778…-809…}›,raft] 47622 removing replica r1528/3
W230510 20:16:13.487416 291 kv/kvserver/replica_raft.go:1378 ⋮ [n2,s2,r1528/5:‹›,raft] 228 have been waiting 63.00s for slow proposal RequestLease [/Table/143/1/‹›/‹›/›,/Min)
I230510 20:16:13.488216 29265 kv/kvserver/replica_circuit_breaker.go:149 ⋮ [n2,s2,r1528/5:‹› [(n2,s2):5, (n4,s4):4, (n1,s1):3, next=6, gen=114, sticky=1682537079.026979847,0]: closed timestamp: 1682538135.232405540,0 (2023-04-26 19:42:15); raft status: {"id":"5","term":7,"vote":"5","commit":115,"lead":"0","raftState":"StatePreCandidate","applied":115,"progress":{},"leadtransferee":"0"}: have been waiting 63.00s for slow proposal RequestLease [/Table/143/1/,/Min)
I230510 20:17:13.489336 29314 kv/kvserver/replica_circuit_breaker.go:149 ⋮ [n2,s2,r1528/5:‹› [(n2,s2):5, (n4,s4):4, (n1,s1):3, next=6, gen=114, sticky=1682537079.026979847,0]: closed timestamp: 1682538135.232405540,0 (2023-04-26 19:42:15); raft status: {"id":"5","term":7,"vote":"5","commit":115,"lead":"0","raftState":"StatePreCandidate","applied":115,"progress":{},"leadtransferee":"0"}: have been waiting 63.00s for slow proposal RequestLease [/Table/143/1/›,/Min))
I230510 20:18:16.489259 75566 kv/kvserver/replica_circuit_breaker.go:149 ⋮ [n2,s2,r1528/5:‹› [(n2,s2):5, (n4,s4):4, (n1,s1):3, next=6, gen=114, sticky=1682537079.026979847,0]: closed timestamp: 1682538135.232405540,0 (2023-04-26 19:42:15); raft status: {"id":"5","term":7,"vote":"5","commit":115,"lead":"0","raftState":"StatePreCandidate","applied":115,"progress":{},"leadtransferee":"0"}: ‹operation "probe" timed out after 1m0.001s (given timeout 1m0s)›: result is ambiguous: after 60.00s of attempting command: context deadline exceeded
I230510 20:52:55.489074 3560374 kv/kvserver/replica_circuit_breaker.go:149 ⋮ [n2,s2,r1528/5:‹› [(n2,s2):5, (n4,s4):4, (n1,s1):3, next=6, gen=114, sticky=1682537079.026979847,0]: closed timestamp: 1682538135.232405540,0 (2023-04-26 19:42:15); raft status: {"id":"5","term":7,"vote":"5","commit":115,"lead":"0","raftState":"StatePreCandidate","applied":115,"progress":{},"leadtransferee":"0"}: ‹operation "probe" timed out after 1m0s (given timeout 1m0s)›: result is ambiguous: after 60.00s of attempting command: context deadline exceeded
I230510 20:53:55.491270 3695755 kv/kvserver/replica_circuit_breaker.go:149 ⋮ [n2,s2,r1528/5:‹› [(n2,s2):5, (n4,s4):4, (n1,s1):3, next=6, gen=114, sticky=1682537079.026979847,0]: closed timestamp: 1682538135.232405540,0 (2023-04-26 19:42:15); raft status: {"id":"5","term":7,"vote":"5","commit":115,"lead":"0","raftState":"StatePreCandidate","applied":115,"progress":{},"leadtransferee":"0"}: ‹operation "probe" timed out after 1m0s (given timeout 1m0s)›: result is ambiguous: after 60.00s of attempting command: context deadline exceeded
I230510 20:54:57.312268 3943163 kv/kvserver/store_remove_replica.go:150 ⋮ [n2,replicaGC,s2,r1528/5:‹›] 5611 removing replica r1528/5
I230510 20:54:57.323786 3821707 kv/kvserver/replica_circuit_breaker.go:149 ⋮ [n2,s2,r1528/5:‹› [(n2,s2):5, (n4,s4):4, (n1,s1):3, next=6, gen=114, sticky=1682537079.026979847,0]: closed timestamp: 1682538135.232405540,0 (2023-04-26 19:42:15); raft status: {"id":"5","term":7,"vote":"5","commit":115,"lead":"0","raftState":"StatePreCandidate","applied":115,"progress":{},"leadtransferee":"0"}: ‹operation "probe" timed out after 1m0s (given timeout 1m0s)›: result is ambiguous: after 60.00s of attempting command: context deadline exceeded)
I230510 20:54:57.347495 3943163 kv/kvserver/replica_destroy.go:177 ⋮ [n2,replicaGC,s2,r1528/5:‹›] 5613 removed 873073 (873065+8) keys in 33ms [clear=9ms commit=24ms]

That is, it was merged away about 13 days before its replica circuit breaker starting firing. It stopped firing after a very delayed GC of the replica. It had started firing after an attempt to request a lease, which in turn happened because it was enqueued in another KV queue that required leases in order to be processed. This makes for a benign but "false positive" circuit breaker tripped error. It also increments metrics that operators use to indicate CRDB health.

Jira issue: CRDB-28604

Epic CRDB-39898

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

erikgrinaker commented 1 year ago

Looks like a couple of clear problems here: don't apply circuit breakers to destroyed replicas, and make replica GC more aggressive.

shralex commented 1 year ago

@nicktrav for visibility / prioritization of this o-support issue

nicktrav commented 1 year ago

Update: we're not going to get a chance to work on this in 23.2. Leaving open and in our backlog.

lyang24 commented 7 months ago

Looks like a couple of clear problems here: don't apply circuit breakers to destroyed replicas, and make replica GC more aggressive.

Hey @erikgrinaker, I want to help on this issue. I have a change on don't apply circuit breakers to destroyed replicas with some questions inside the pr. I wonder if you can help me understand more about make replica GC more aggressive. with more details. No rush on this and thanks in advance.