Open aayushshah15 opened 2 years ago
@nvanbenschoten suggested that we could split up the StoreRebalancer
goroutine up into two: the first that's only concerned with lease transfers and the second that's only concerned with replica rebalances.
A rough proposal for this was that things would roughly stay similar to how they are today in that the store rebalancer will still first try to exhaust all lease transfer opportunities before resorting to replica rebalances. The idea is that, once all lease transfer opportunities are exhausted, the store-rebalancer's lease transfer goroutine will simply tell the replica rebalance goroutine (over a channel) to try to rebalance a given range. If there are any opportunities to reconfigure the range to a better set of stores, this second goroutine will go ahead and start executing this change. However, if the replica rebalance goroutine is already in the middle of executing a rebalance, the lease transfer goroutine will bounce off and wait for its next iteration. In other words, the lease transfer goroutine will never "enqueue" more than 1 replica rebalance to be executed at any given time, in order to avoid situations where rebalances are being executed arbitrarily later than when they were deemed important.
This rough proposal lets avoid the main hazard outlined by this issue, which is that lease transfers can get blocked over replica rebalances for an arbitrarily long amount of time.
cc @kvoli and @lidorcarmel
The idea is that, once all lease transfer opportunities are exhausted, the store-rebalancer's lease transfer goroutine will simply tell the replica rebalance goroutine (over a channel) to try to rebalance a given range. If there are any opportunities to reconfigure the range to a better set of stores, this second goroutine will go ahead and start executing this change. However, if the replica rebalance goroutine is already in the middle of executing a rebalance, the lease transfer goroutine will bounce off and wait for its next iteration.
Would the range currently enqueued be excluded from consideration in the store rebalancer lease goroutine, while being processed by the replica goroutine?
split up the StoreRebalancer goroutine up into two: the first that's only concerned with lease transfers and the second that's only concerned with replica rebalances
To clarify - the first is actually concerned with both lease transfers and deciding which replica rebalances are needed, meaning it does all the scanning (of the hottest ranges), and the second doesn't scan the ranges - it only performs the actual replica rebalance (I guess we can say that long operations run asynchronously and not in the main loop). Hopefully that's the intention.
Would the range currently enqueued be excluded from consideration in the store rebalancer lease goroutine, while being processed by the replica goroutine?
I'm assuming yes.
The
StoreRebalancer
goroutine synchronously executes load-based lease transfers and load-based replica rebalances of the hottest ranges in a loop.This means that, when a cluster is under duress and load-based replica rebalancing is taking a ~large amount of time, this can block the store rebalancer goroutine (blocking cheaper actions like load-based lease transfers) for an inordinate amount of time until the
AdminRelocateRange
call for each "hot range" to be processed either fails or hits its timeout. In other words, if theStoreRebalancer
tries to rebalance away 1 replica each for a 100 ranges, and those rebalances are bound to hit their timeout, we won't see any load-based rebalancing on this store for a ~100minutes at a minimum.We noticed this during an escalation where a single store on a hot node couldn't shed its load away because of this. The logs indicated that the StoreRebalancer goroutine was simply blocked on a ton of
AdminRelocateRange
calls that were eventually timing out:Nodes
173
and159
^ were both nodes that had extremely high read amp during this incident.@cockroachdb/kv-notifications
Jira issue: CRDB-14656