apache / helix

Mirror of Apache Helix
Apache License 2.0
457 stars 218 forks source link

[apache/helix] -- Issue during onboarding resources without instances #2782

Closed himanshukandwal closed 4 months ago

himanshukandwal commented 4 months ago

Issues

Description

To solve this issue we are adding a check in the ResourceComputationStage to have such resources excluded from the pipeline computation and only be considered when there are actual resource partitions (>0) to be assigned to the instances.

org.apache.helix.HelixRebalanceException: Failed to compute for delayed rebalance overwrites in cluster ZnRecord=CLUSTER_TestWagedClusterExpansionWithAddingResourcesBeforeInstances, {DELAY_REBALANCE_ENABLED=true, DELAY_REBALANCE_TIME=3000000, FAULT_ZONE_TYPE=zone, PERSIST_BEST_POSSIBLE_ASSIGNMENT=true, TOPOLOGY=/zone/instance, TOPOLOGY_AWARE_ENABLED=true}{REBALANCE_PREFERENCE={EVENNESS=0, LESS_MOVEMENT=10}}{}, Stat=Stat {_version=4, _creationTime=1710804015571, _modifiedTime=1710804047240, _ephemeralOwner=0} Failure Type: INVALID_CLUSTER_STATUS
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:428) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.emergencyRebalance(WagedRebalancer.java:501) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleAssignment(WagedRebalancer.java:339) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleStates(WagedRebalancer.java:316) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:248) [classes/:?]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:406) [classes/:?]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:258) [classes/:?]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:91) [classes/:?]
    at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) [classes/:?]
    at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [classes/:?]
    at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [classes/:?]
Caused by: java.lang.NullPointerException
    at org.apache.helix.controller.rebalancer.util.DelayedRebalanceUtil.findToBeAssignedReplicasForMinActiveReplica(DelayedRebalanceUtil.java:335) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModel(ClusterModelProvider.java:257) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModelForDelayedRebalanceOverwrites(ClusterModelProvider.java:82) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:415) ~[classes/:?]
    ... 10 more
94184 [main] ERROR org.apache.helix.controller.rebalancer.waged.WagedRebalancer [] - Failed to compute for delayed rebalance overwrites in cluster CLUSTER_TestWagedClusterExpansionWithAddingResourcesBeforeInstances
94188 [main] ERROR org.apache.helix.controller.rebalancer.waged.WagedRebalancer [] - Failed to calculate the new assignments.
org.apache.helix.HelixRebalanceException: Failed to compute for delayed rebalance overwrites in cluster ZnRecord=CLUSTER_TestWagedClusterExpansionWithAddingResourcesBeforeInstances, {DELAY_REBALANCE_ENABLED=true, DELAY_REBALANCE_TIME=3000000, FAULT_ZONE_TYPE=zone, PERSIST_BEST_POSSIBLE_ASSIGNMENT=true, TOPOLOGY=/zone/instance, TOPOLOGY_AWARE_ENABLED=true}{REBALANCE_PREFERENCE={EVENNESS=0, LESS_MOVEMENT=10}}{}, Stat=Stat {_version=4, _creationTime=1710804015571, _modifiedTime=1710804047240, _ephemeralOwner=0} Failure Type: INVALID_CLUSTER_STATUS
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:428) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.emergencyRebalance(WagedRebalancer.java:514) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleAssignment(WagedRebalancer.java:339) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleStates(WagedRebalancer.java:316) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:248) [classes/:?]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:406) [classes/:?]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:258) [classes/:?]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:91) [classes/:?]
    at org.apache.helix.util.RebalanceUtil.runStage(RebalanceUtil.java:235) [classes/:?]
    at org.apache.helix.tools.ClusterVerifiers.BestPossibleExternalViewVerifier.calcBestPossState(BestPossibleExternalViewVerifier.java:444) [classes/:?]
    at org.apache.helix.tools.ClusterVerifiers.BestPossibleExternalViewVerifier.verifyState(BestPossibleExternalViewVerifier.java:293) [classes/:?]
    at org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier.verifyByPolling(ZkHelixClusterVerifier.java:209) [classes/:?]
    at org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier.verifyByPolling(ZkHelixClusterVerifier.java:229) [classes/:?]
    at org.apache.helix.integration.rebalancer.WagedRebalancer.TestWagedClusterExpansionWithAddingResourcesBeforeInstances.testExpandClusterWithResourceWithoutInstances(TestWagedClusterExpansionWithAddingResourcesBeforeInstances.java:151) [test-classes/:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
Caused by: java.lang.NullPointerException
    at org.apache.helix.controller.rebalancer.util.DelayedRebalanceUtil.findToBeAssignedReplicasForMinActiveReplica(DelayedRebalanceUtil.java:335) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModel(ClusterModelProvider.java:257) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModelForDelayedRebalanceOverwrites(ClusterModelProvider.java:82) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:415) ~[classes/:?]
    ... 37 more

Tests

(List the names of added unit/integration tests)

➜  helix_os_hk git:(hkandwal/waged-adding-resources-without-capacity) mvn clean install -Dmaven.test.skip.exec=true && mvn test -o -Dtest=TestWagedClusterExpansionWithAddingResourcesBeforeInstances -pl=helix-core

[INFO] --- surefire:3.0.0-M3:test (default-test) @ helix-core ---
[INFO] 
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.helix.integration.rebalancer.WagedRebalancer.TestWagedClusterExpansionWithAddingResourcesBeforeInstances
Start zookeeper at localhost:2183 in thread main
AfterClass: TestWagedClusterExpansionWithAddingResourcesBeforeInstances called.
Shut down zookeeper at port 2183 in thread main
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.354 s - in org.apache.helix.integration.rebalancer.WagedRebalancer.TestWagedClusterExpansionWithAddingResourcesBeforeInstances
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-core ---
[INFO] Loading execution data file /Users/hkandwal/Documents/workspaces/projects/helix_os_hk/helix-core/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Core' with 950 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  54.225 s
[INFO] Finished at: 2024-03-27T16:29:31-07:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

(Link the GitHub wiki you added)

Commits

Code Quality

junkaixue commented 4 months ago

Is this just for adding a test?

himanshukandwal commented 4 months ago

Is this just for adding a test?

Hey @junkaixue yes, I first reproduced it using the test, now added the fix. Would you pls review it when get a chance.

himanshukandwal commented 4 months ago

Please remove WIP from title

Done

himanshukandwal commented 4 months ago

This PR has been reviewed and approved by @zpinto.

Final Commit Message: When adding a new WAGED resource with a tag and without any instances against that tag, we are observing NPE coming from the system. To solve this issue we are adding a check in the ResourceComputationStage to have such resources excluded from the pipeline computation and only be considered when there are actual resource partitions (>0) to be assigned to the instances.

junkaixue commented 4 months ago

lgtm. Let's wait for the tests.

himanshukandwal commented 4 months ago

Hey @junkaixue, the PR CI successful.

Also ran the full CI suite in my repo and thats successful as well: https://github.com/himanshukandwal/helix/actions/runs/8471701884/job/23212239715