The cause of the test case being flaky is due to switching the resources from SEMI_AUTO to FULL_AUTO while the cluster is in MaintenanceMode.
When a resource is SEMI_AUTO, the MM rebalancer is not used because that would cause the preferenceList to potentially change and never recover to what it previously was. In the test case, we were switching the resources from SEMI_AUTO to FULL_AUTO causing the MM rebalancer to be used. There is then a RACE condition between the controller computing a new IdealState which drops the offline instances from the preferenceList, making the IdealState invalid for SEMI_AUTO, and us setting the resources back to SEMI_AUTO. If the controller wins, persisting the IdealState again with SEMI_AUTO will throw an exception.
Removing this logic to just test that isEvacuateFinished is true since all resources are SEMI_AUTO. We test isEvacuateFinished on FULL_AUTO resources in other places like TestZkHelixAdmin and TestInstanceOperation.
Changes that Break Backward Compatibility (Optional)
NA
Commits
My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
Subject is separated from body by a blank line
Subject is limited to 50 characters (not including Jira issue reference)
Subject does not end with a period
Subject uses the imperative mood ("add", not "adding")
Body wraps at 72 characters
Body explains "what" and "why", not "how"
Code Quality
My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)
The cause of the test case being flaky is due to switching the resources from SEMI_AUTO to FULL_AUTO while the cluster is ### Issues
Description
The cause of the test case being flaky is due to switching the resources from SEMI_AUTO to FULL_AUTO while the cluster is in MaintenanceMode.
When a resource is SEMI_AUTO, the MM rebalancer is not used because that would cause the preferenceList to potentially change and never recover to what it previously was. In the test case, we were switching the resources from SEMI_AUTO to FULL_AUTO causing the MM rebalancer to be used. There is then a RACE condition between the controller computing a new IdealState which drops the offline instances from the preferenceList, making the IdealState invalid for SEMI_AUTO, and us setting the resources back to SEMI_AUTO. If the controller wins, persisting the IdealState again with SEMI_AUTO will throw an exception.
Removing this logic to just test that isEvacuateFinished is true since all resources are SEMI_AUTO. We test isEvacuateFinished on FULL_AUTO resources in other places like TestZkHelixAdmin and TestInstanceOperation.
Tests
Changes that Break Backward Compatibility (Optional)
NA
Commits
Code Quality