apache / helix

Mirror of Apache Helix
Apache License 2.0
457 stars 218 forks source link

NPE Issues when we add a WAGED resource without instances against the resource tag #2781

Closed himanshukandwal closed 4 months ago

himanshukandwal commented 4 months ago

Describe the bug

We are observing a NPE when we are trying to add a WAGED resource without instances against the resource tag.

org.apache.helix.HelixRebalanceException: Failed to compute for delayed rebalance overwrites in cluster ZnRecord=CLUSTER_TestWagedClusterExpansionWithAddingResourcesBeforeInstances, {DELAY_REBALANCE_ENABLED=true, DELAY_REBALANCE_TIME=3000000, FAULT_ZONE_TYPE=zone, PERSIST_BEST_POSSIBLE_ASSIGNMENT=true, TOPOLOGY=/zone/instance, TOPOLOGY_AWARE_ENABLED=true}{REBALANCE_PREFERENCE={EVENNESS=0, LESS_MOVEMENT=10}}{}, Stat=Stat {_version=4, _creationTime=1710804015571, _modifiedTime=1710804047240, _ephemeralOwner=0} Failure Type: INVALID_CLUSTER_STATUS
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:428) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.emergencyRebalance(WagedRebalancer.java:501) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleAssignment(WagedRebalancer.java:339) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeBestPossibleStates(WagedRebalancer.java:316) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:248) [classes/:?]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:406) [classes/:?]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:258) [classes/:?]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:91) [classes/:?]
    at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) [classes/:?]
    at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [classes/:?]
    at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [classes/:?]
Caused by: java.lang.NullPointerException
    at org.apache.helix.controller.rebalancer.util.DelayedRebalanceUtil.findToBeAssignedReplicasForMinActiveReplica(DelayedRebalanceUtil.java:335) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModel(ClusterModelProvider.java:257) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.model.ClusterModelProvider.generateClusterModelForDelayedRebalanceOverwrites(ClusterModelProvider.java:82) ~[classes/:?]
    at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.handleDelayedRebalanceMinActiveReplica(WagedRebalancer.java:415) ~[classes/:?]
    ... 10 more

To Reproduce

  1. Create a WAGED resource with no instances belonging to the resource tag
  2. Set # partitions as 0
  3. Trigger Cluster rebalance

Expected behavior

We should not observe NPE coming from the system and see graceful handling of the scenario.

Additional context

Add any other context about the problem here.