Closed jihoonson closed 9 months ago
@jihoonson - DiskNormalizedCostBalancerStrategy has been working out quite well for us in general. Where as CostBalancerStrategy generally leads to non-uniform distribution at least when a new cluster is spun up. Possibly the default load strategy should be switched to DiskNormalizedCostBalancerStrategy? Unless of course, other folks have ran into other issues with it.
That’s interesting. I think I’ve seen a similar skewed distribution with CostBalancerStrategy which had gone when we switched to DiskNormalizedCostBalancerStrategy. I think it’s a good idea to make it as a default but do you know what is causing the skewed distribution?
Motivation
One of what the coordinator is responsible for is segment balancing across historicals. We support several balancer strategies including the cost balancer strategy implemented in https://github.com/apache/druid/pull/2972. The cost balancer strategy (and its variants) would be the most popular strategy for now. This strategy is pretty good in most cases in production, but sometimes it could lead to an imbalanced segment distribution. However, since the segment balancing is done over a long period, it's not easy to debug why the balancer sometimes makes a suboptimal decision.
Proposed changes
This proposal is to add a new tool which emulates the segment balancing of the coordinator and reports metrics. It would accept the below input configurations.
The result would be metrics as below across all datasources and per datasource.
Rationale
The problems in segment balancing usually happen when a production cluster has been running for a while. It's not easy to replicate the problem locally or in a separate test cluster. Soak test is also not easy because sometimes it requires to run the cluster for a fairly long time to replicate the problem.
Operational impact
There is no operational impact.
Future work