Controller can violate fault zone placement constraints during intermediate state

Describe the bug

Four resource with replication factor of 3, the controller allows an intermediate state where there are 3 replicas, but 2 are in the same fault zone. This can occur from the following: — Replica 1 (FZ A), Replica 2 (FZ B), Replica 3 (FZ C) Replica 1 (FZ A), Replica 2 (FZ B), Replica 3 (FZ C), Replica 4 (FZ B) <- n+1 movement Replica 1 (FZ A), Replica 2 (FZ B), Replica 4 (FZ B)

This is still an intermediate state and the controller will immediately try to bootstrap a 4th replica to drop one of the replicas in FZ B. However, if the cluster cannot complete the n+1 movement on the targeted nodes due to capacity constraints, then it will remain in this state indefinitely. No WAGED calculations errors will be thrown. With no capacity constraints preventing movement, this state will remain for as long as it takes to bootstrap the other replica from OFFLINE→SLAVE.

There is also a follow-up issue that stoppable check will not prevent the fault zone where 2/3 replicas exist from being taken down. This is because stoppable parallelizes same MZ checks and works on the assumption there will not be more than 1 replica in a single fault zone.

To Reproduce

Have not determined how to reproduce this specific case, but it occurs during movement within the cluster. Seems that it can occur when a node in the current state but not the preference list occupies the same fault zone as a node that is in the preference list but not the current state. Controller will create a n+1 replica on the node in the preference list, but then possibly drop a different node and not the one already in the same fault zone as newly bootstrapped replica.

Expected behavior

Controller should take into account topology when determining which replica to drop first. If the controller drops a replica from a fault zone that already has another replica in it, then we will not decrease the # of fault zones where replicas for the partition exist.

Additional context

There are two fixes needed:

Change the controller logic for deciding replica priority so that MZ's that have the most # of replicas are preferentially dropped.
Address stoppable check not accounting for multiple replicas in the same MZ during the min_active_replica check

apache / helix