liveness: medium-probability global ~5s cluster unavailability if liveness range leasholder dies

knz commented 5 years ago

Describe the problem

Taking a 3+ node cluster, subject to load (eg. kv or tpcc).

While the workload is running, randomly kill (not quit) a node that the client is not connected to and restart it.

Most of the time, the workload continues unaffected, or with a small throughput dip. Sometimes however (with likelihood decreasing with the number of nodes) the QPS traffic goes down to zero for about 5 seconds, then resumes.

When this happens it seems to correlated with the following entries in the logs of other nodes:

I190911 12:40:08.053787 317 storage/node_liveness.go:836  [n3,liveness-hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (error=unable to dial n2: breaker open [exhausted])
W190911 12:40:08.053807 317 storage/node_liveness.go:559  [n3,liveness-hb] slow heartbeat took 4.5s
W190911 12:40:08.053832 317 storage/node_liveness.go:484  [n3,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s

Hypothesis is that if the leaseholder for the liveness range (rL) dies, and the lease on another unrelated range (rX) expires on a different node nX, then nX will require a liveness record to ask for a new lease, which will fail for about 5 seconds, until rL finds a new leaseholder.

if rX is the meta range or a system range (namespace etc) the unavailability can become global.

Some example runs with attached artifacts:

http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-g/20190911124637/index.html

http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-g/20190911130744/index.html

http://shakespeare-artifacts.crdb.io/public/201909-report/tpcc-small-h/20190911140000/index.html

(many more available)

cc @tbg

Jira issue: CRDB-5485

knz commented 5 years ago

@tbg let me know if I remember this right, your proposed solution would be to tweak the lease expiration delay in relationship to the liveness record expiration delay. What was the ratio you were thinking about again?

knz commented 5 years ago

Another 5-second suspicious log interval:

I190911 13:59:07.475858 196 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc [::]:26257 [n3] tripped: failed to resolve n3: unable to look up descriptor for n3
I190911 13:59:07.475875 196 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerTripped
I190911 13:59:07.669786 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc [::]:26257 [n3] tripped: failed to resolve n3: unable to look up descriptor for n3
I190911 13:59:07.669800 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerTripped
I190911 13:59:07.669881 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc [::]:26257 [n4] tripped: failed to resolve n4: unable to look up descriptor for n4
I190911 13:59:07.669907 197 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n4] event: BreakerTripped
I190911 13:59:08.130072 54 storage/node_liveness.go:474  [n2,liveness-hb] heartbeat failed on epoch increment; retrying
I190911 13:59:08.263502 412 storage/node_liveness.go:790  [n2,s2,r3/2:/System/{NodeLive…-tsd}] incremented n3 liveness epoch to 6
I190911 13:59:09.585542 164 gossip/gossip.go:1531  [n2] node has connected to cluster via gossip
I190911 13:59:09.585655 164 storage/stores.go:259  [n2] wrote 1 node addresses to persistent storage
I190911 13:59:11.333594 86 storage/stores.go:259  [n2] wrote 2 node addresses to persistent storage
I190911 13:59:11.408464 198 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerReset
I190911 13:59:12.596335 263 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc [::]:26257 [n3] event: BreakerReset
I190911 13:59:13.279968 218 storage/queue.go:518  [n2,s2,r3/2:/System/{NodeLive…-tsd}] rate limited in MaybeAdd (raftlog): context canceled

irfansharif commented 4 years ago

@knz: your hypothesis sounds right, this is (unfortunately) a very possible blip that would occur in that situation. It's not clear to me what can immediately be done about it however. No matter what the ratio between lease expiration and liveness record expiration, the simultaneous expiration could still very well happen at ~ the same point in time, no?

knz commented 4 years ago

Yes as long as the protocol is unchanged the fault remains.

Different protocol might be an option, perhaps pre-elect the next lease ahead of time

tbg commented 3 years ago

@knz: your hypothesis sounds right, this is (unfortunately) a very possible blip that would occur in that situation. It's not clear to me what can immediately be done about it however. No matter what the ratio between lease expiration and liveness record expiration, the simultaneous expiration could still very well happen at ~ the same point in time, no?

No, because the liveness record is heartbeat ahead of expiration. I think the liveness expiration is 9s, and after 4.5s, nodes will try to extend it. So let's say that the lease duration on the liveness range is 3s, in the "worst case" (excluding network latencies, etc, which will make this case a little worse), a node wants to refresh its liveness record when it has 4.5s left on the clock. Liveness range is down for the first 3s of that, and then recovers. That leaves 1.5s for the node to heartbeat its liveness before it goes dark.

It does seem like a little tuning can get us a long way here? Right now, the expiration based lease duration matches the liveness record duration:

https://github.com/cockroachdb/cockroach/blob/1d0597990de15fc7722d288a4a3ebd50cf06b50a/pkg/base/config.go#L436-L458

It doesn't seem unreasonable to do something like this:

// RangeLeaseActiveDuration is the duration of the active period of leader
// leases requested.
func (cfg RaftConfig) RangeLeaseActiveDuration() time.Duration {
    rangeLeaseActive, _ := cfg.RangeLeaseDurations()
    if rangeLeaseActive < time.Second { // avoid being overly aggressive in tests, not sure this is needed but probably
        return rangeLeaseActive
    }
    return rangeLeaseActive / 3
}

There are some loose ends (if we set the lease expiration to 3s, but the raft election timeout is 10s, it doesn't buy us anything, though I think it's closer to 3s as well and so it should be fine)

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

cockroachdb / cockroach

liveness: medium-probability global ~5s cluster unavailability if liveness range leasholder dies #41162