apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.45k stars 1.28k forks source link

Helix exceptions when running integration tests #4355

Open kishoreg opened 5 years ago

kishoreg commented 5 years ago

Looks like this is happening after we added leadControllerResource. Lets clean up other Helix exceptions as well when we look into this.

2019/06/22 09:05:16.929 WARN [BestPossibleStateCalcStage] [HelixController-async_tasks-OfflineClusterIntegrationTest] Event aaadf5d1_DEFAULT : Failed to calculate best possible states for 1 resources.
2019/06/22 09:05:21.954 WARN [AutoRebalancer] [HelixController-pipeline-default-OfflineClusterIntegrationTest] Resource leadControllerResource has tag controller but no configured participants have this tag
2019/06/22 09:05:21.954 ERROR [CRUSHPlacementAlgorithm] [HelixController-pipeline-default-OfflineClusterIntegrationTest] 1 nodes of type INSTANCE were requested but the tree has only 0 nodes!
2019/06/22 09:05:21.955 ERROR [BestPossibleStateCalcStage] [HelixController-pipeline-default-OfflineClusterIntegrationTest] Event 512cdf47_DEFAULT : Error computing assignment for resource leadControllerResource. Skipping.
java.lang.IllegalStateException: null
    at org.apache.helix.controller.rebalancer.strategy.crushMapping.CRUSHPlacementAlgorithm$Selector.select(CRUSHPlacementAlgorithm.java:308) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.rebalancer.strategy.crushMapping.CRUSHPlacementAlgorithm.select(CRUSHPlacementAlgorithm.java:119) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.doSelect(CrushRebalanceStrategy.java:174) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.select(CrushRebalanceStrategy.java:140) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.computePartitionAssignment(CrushRebalanceStrategy.java:92) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.computePartitionAssignment(CrushRebalanceStrategy.java:48) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.rebalancer.strategy.AbstractEvenDistributionRebalanceStrategy.computePartitionAssignment(AbstractEvenDistributionRebalanceStrategy.java:89) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.rebalancer.strategy.AbstractEvenDistributionRebalanceStrategy.computePartitionAssignment(AbstractEvenDistributionRebalanceStrategy.java:49) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.rebalancer.AutoRebalancer.computeNewIdealState(AutoRebalancer.java:129) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.rebalancer.AutoRebalancer.computeNewIdealState(AutoRebalancer.java:51) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleState(BestPossibleStateCalcStage.java:245) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:121) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:77) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:68) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:640) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.GenericHelixController.access$400(GenericHelixController.java:117) ~[helix-core-0.9.0.jar:0.9.0]
    at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1168) ~[helix-core-0.9.0.jar:0.9.0]
jackjlli commented 5 years ago

This exception is thrown from Helix code because lead controller resource is disabled for now. After the logic for lead controller resource is committed and it is enabled, this exception will be gone.

mcvsubbu commented 5 years ago

This seems like a helix issue to me. A resource disabled by operator should not be throwing exceptions, right? @jackjlli can you file an issue with Helix including the exception stack? Until helix fixes the exception we should ignore it, I suppose.

kishoreg commented 5 years ago

I see, I also noticed that QuickStart throws exception in the end and most integration tests take longer time to setup the cluster. Is this also related to Helix change?

jackjlli commented 5 years ago

The issue is filed: https://github.com/apache/helix/issues/322. The slowness may also come from the new release of Helix, because the logic of Helix controller getting/preparing leadership has been changed in Version 0.9. But as they said it's doing the right approach in the new release.