Open ajbeamon opened 4 years ago
The team collection that manage team removal procedure is designed to be unaware of the load on teams. This design is to prevent the teams distributions from being affected by users' skewed traffic and hurting the fault-tolerance.
The solution should make the DD smarter in moving data around instead of making team collection aware of such problem.
The data distribution team removal procedure gets run when changing the machines present in a cluster (for example, by exclusion/inclusion, adding new machines, etc.). When it happens, it often seems to result in a significant imbalance in the number of bytes stored on different processes.
This is a problem because some of the processes end up storing significantly more than they had been previously (one example I saw was 25% more for the worst process), which may not be easily accommodated in fuller clusters.
This is eventually healed after the team removal is complete and rebalancing movement can correct the problem.