apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.5k stars 3.7k forks source link

A simulator for segment balancing by the coordinator #9087

Closed jihoonson closed 9 months ago

jihoonson commented 4 years ago

Motivation

One of what the coordinator is responsible for is segment balancing across historicals. We support several balancer strategies including the cost balancer strategy implemented in https://github.com/apache/druid/pull/2972. The cost balancer strategy (and its variants) would be the most popular strategy for now. This strategy is pretty good in most cases in production, but sometimes it could lead to an imbalanced segment distribution. However, since the segment balancing is done over a long period, it's not easy to debug why the balancer sometimes makes a suboptimal decision.

Proposed changes

This proposal is to add a new tool which emulates the segment balancing of the coordinator and reports metrics. It would accept the below input configurations.

The result would be metrics as below across all datasources and per datasource.

Rationale

The problems in segment balancing usually happen when a production cluster has been running for a while. It's not easy to replicate the problem locally or in a separate test cluster. Soak test is also not easy because sometimes it requires to run the cluster for a fairly long time to replicate the problem.

Operational impact

There is no operational impact.

Future work

samarthjain commented 4 years ago

@jihoonson - DiskNormalizedCostBalancerStrategy has been working out quite well for us in general. Where as CostBalancerStrategy generally leads to non-uniform distribution at least when a new cluster is spun up. Possibly the default load strategy should be switched to DiskNormalizedCostBalancerStrategy? Unless of course, other folks have ran into other issues with it.

jihoonson commented 4 years ago

That’s interesting. I think I’ve seen a similar skewed distribution with CostBalancerStrategy which had gone when we switched to DiskNormalizedCostBalancerStrategy. I think it’s a good idea to make it as a default but do you know what is causing the skewed distribution?

kfaraz commented 9 months ago

Fixed by https://github.com/apache/druid/pull/13074