apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.4k stars 3.68k forks source link

On "balancing burst, then wait total loading" pattern and replication/loading cap in Coordinator balancing #7159

Open leventov opened 5 years ago

leventov commented 5 years ago

Druid's segment balancing logic is split between two classes: DruidCoordinatorBalancer and LoadRule. They both skip segment loading and balancing until all their actions from the previous non-skipped run (resulted in a balancing burst) are completely settled down (per tier), i. e. there are no loading segments on any of the historical nodes in some tier.

The "hard" reason for obeying this pattern in LoadRule is that SegmentReplicantLookup is effectively immutable and recreated at the beginning of a Coordinator's cycle (see CoordinatorHistoricalManagerRunnable).

The reason for obeying this pattern in DruidCoordinatorBalancer is not documented anywhere, but as far as I can tell, that's done to make balancing decisions more optimal: we quickly fill up loading queues of the "universally best" historical nodes (e. g. because they've just entered the cluster and are almost empty yet) and are forced to move segments to less optimal nodes. (This is also probably the primary criterion for choosing maxSegmentsToMove configuration value, although the documentation is silent about this.) This "soft" reason is implicitly in play in LoadRule, too.

The problem is that LoadRule and DruidCoordinatorBalancer use independent mechanism to implement the "burst - wait" pattern: LoadRule uses ReplicationThrottler (implicitly in canCreateReplicant() method), DruidCoordinatorBalancer uses it's own currentlyMovingSegments field.

As you can notice from ReplicationThrottler's name, it's also responsible for capping the number of segments that can be moved in LoadRule during a single burst, via replicationThrottleLimit configuration. In other words, replicationThrottleLimit has exactly the same purpose as maxSegmentsToMove. But they don't mention each other and they effectively add up to each other.

I see the solution in gathering all balancing logic to a single class, deprecate either replicationThrottleLimit or maxSegmentsToMove in favor of the other one as the unified limit for balancing bursts.

Also, there is a problem with waiting until exactly zero remaining loading segments from the previous burst in a tier: struggling loaders might make pauses between the bursts longer and thus the effective balancing and loading throughput lower. Instead, we may initiate the next burst when only 10-20% of segments are remaining unloaded from the previous burst. We don't want to forgo the "burst - wait" pattern completely and make just so many balancing and loading decisions on each Coordinator's run to add up to the unified limit because this may keep the cluster always close to the threshold of suboptimality of balancing decisions and thus making balancing decisions less optimal overall.

To be able to initiate the next burst when the previous burst is not complete SegmentReplicantLookup should be made concurrently updatable. Instead of recreating it before each Coordinator's run, it's an ever-living concurrent data structure.

leventov commented 5 years ago

Another instance when the same thing is split between balancing and loading sides is BalancerSegmentHolder.lifetime (currently configurable, equals to 15) and CoordinatorDynamicConfig.replicantLifetime (affects the "lifetime" for loading).

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

stale[bot] commented 5 years ago

This issue is no longer marked as stale.

licl2014 commented 4 years ago

if currentlyReplicating.removeSegment faster than currentlyReplicating.addSegment , maxReplicants in ReplicationThrottler can't restrict the number of replicate segment assign. @leventov