cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.97k stars 3.79k forks source link

kvserver: store-rebalancer can get blocked on load-based replica rebalances #79249

Open aayushshah15 opened 2 years ago

aayushshah15 commented 2 years ago

The StoreRebalancer goroutine synchronously executes load-based lease transfers and load-based replica rebalances of the hottest ranges in a loop.

This means that, when a cluster is under duress and load-based replica rebalancing is taking a ~large amount of time, this can block the store rebalancer goroutine (blocking cheaper actions like load-based lease transfers) for an inordinate amount of time until the AdminRelocateRange call for each "hot range" to be processed either fails or hits its timeout. In other words, if the StoreRebalancer tries to rebalance away 1 replica each for a 100 ranges, and those rebalances are bound to hit their timeout, we won't see any load-based rebalancing on this store for a ~100minutes at a minimum.

We noticed this during an escalation where a single store on a hot node couldn't shed its load away because of this. The logs indicated that the StoreRebalancer goroutine was simply blocked on a ton of AdminRelocateRange calls that were eventually timing out:

image

Nodes 173 and 159 ^ were both nodes that had extremely high read amp during this incident.

@cockroachdb/kv-notifications

Jira issue: CRDB-14656

aayushshah15 commented 2 years ago

@nvanbenschoten suggested that we could split up the StoreRebalancer goroutine up into two: the first that's only concerned with lease transfers and the second that's only concerned with replica rebalances.

A rough proposal for this was that things would roughly stay similar to how they are today in that the store rebalancer will still first try to exhaust all lease transfer opportunities before resorting to replica rebalances. The idea is that, once all lease transfer opportunities are exhausted, the store-rebalancer's lease transfer goroutine will simply tell the replica rebalance goroutine (over a channel) to try to rebalance a given range. If there are any opportunities to reconfigure the range to a better set of stores, this second goroutine will go ahead and start executing this change. However, if the replica rebalance goroutine is already in the middle of executing a rebalance, the lease transfer goroutine will bounce off and wait for its next iteration. In other words, the lease transfer goroutine will never "enqueue" more than 1 replica rebalance to be executed at any given time, in order to avoid situations where rebalances are being executed arbitrarily later than when they were deemed important.

This rough proposal lets avoid the main hazard outlined by this issue, which is that lease transfers can get blocked over replica rebalances for an arbitrarily long amount of time.

cc @kvoli and @lidorcarmel

kvoli commented 2 years ago

The idea is that, once all lease transfer opportunities are exhausted, the store-rebalancer's lease transfer goroutine will simply tell the replica rebalance goroutine (over a channel) to try to rebalance a given range. If there are any opportunities to reconfigure the range to a better set of stores, this second goroutine will go ahead and start executing this change. However, if the replica rebalance goroutine is already in the middle of executing a rebalance, the lease transfer goroutine will bounce off and wait for its next iteration.

Would the range currently enqueued be excluded from consideration in the store rebalancer lease goroutine, while being processed by the replica goroutine?

lidorcarmel commented 2 years ago

split up the StoreRebalancer goroutine up into two: the first that's only concerned with lease transfers and the second that's only concerned with replica rebalances

To clarify - the first is actually concerned with both lease transfers and deciding which replica rebalances are needed, meaning it does all the scanning (of the hottest ranges), and the second doesn't scan the ranges - it only performs the actual replica rebalance (I guess we can say that long operations run asynchronously and not in the main loop). Hopefully that's the intention.

Would the range currently enqueued be excluded from consideration in the store rebalancer lease goroutine, while being processed by the replica goroutine?

I'm assuming yes.