Open leventov opened 5 years ago
Another instance when the same thing is split between balancing and loading sides is BalancerSegmentHolder.lifetime
(currently configurable, equals to 15) and CoordinatorDynamicConfig.replicantLifetime
(affects the "lifetime" for loading).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.
This issue is no longer marked as stale.
if currentlyReplicating.removeSegment
faster than currentlyReplicating.addSegment
, maxReplicants
in ReplicationThrottler can't restrict the number of replicate segment assign. @leventov
Druid's segment balancing logic is split between two classes:
DruidCoordinatorBalancer
andLoadRule
. They both skip segment loading and balancing until all their actions from the previous non-skipped run (resulted in a balancing burst) are completely settled down (per tier), i. e. there are no loading segments on any of the historical nodes in some tier.The "hard" reason for obeying this pattern in
LoadRule
is thatSegmentReplicantLookup
is effectively immutable and recreated at the beginning of a Coordinator's cycle (seeCoordinatorHistoricalManagerRunnable
).The reason for obeying this pattern in
DruidCoordinatorBalancer
is not documented anywhere, but as far as I can tell, that's done to make balancing decisions more optimal: we quickly fill up loading queues of the "universally best" historical nodes (e. g. because they've just entered the cluster and are almost empty yet) and are forced to move segments to less optimal nodes. (This is also probably the primary criterion for choosingmaxSegmentsToMove
configuration value, although the documentation is silent about this.) This "soft" reason is implicitly in play inLoadRule
, too.The problem is that
LoadRule
andDruidCoordinatorBalancer
use independent mechanism to implement the "burst - wait" pattern:LoadRule
usesReplicationThrottler
(implicitly incanCreateReplicant()
method),DruidCoordinatorBalancer
uses it's owncurrentlyMovingSegments
field.As you can notice from
ReplicationThrottler
's name, it's also responsible for capping the number of segments that can be moved inLoadRule
during a single burst, viareplicationThrottleLimit
configuration. In other words,replicationThrottleLimit
has exactly the same purpose asmaxSegmentsToMove
. But they don't mention each other and they effectively add up to each other.I see the solution in gathering all balancing logic to a single class, deprecate either
replicationThrottleLimit
ormaxSegmentsToMove
in favor of the other one as the unified limit for balancing bursts.Also, there is a problem with waiting until exactly zero remaining loading segments from the previous burst in a tier: struggling loaders might make pauses between the bursts longer and thus the effective balancing and loading throughput lower. Instead, we may initiate the next burst when only 10-20% of segments are remaining unloaded from the previous burst. We don't want to forgo the "burst - wait" pattern completely and make just so many balancing and loading decisions on each Coordinator's run to add up to the unified limit because this may keep the cluster always close to the threshold of suboptimality of balancing decisions and thus making balancing decisions less optimal overall.
To be able to initiate the next burst when the previous burst is not complete
SegmentReplicantLookup
should be made concurrently updatable. Instead of recreating it before each Coordinator's run, it's an ever-living concurrent data structure.