Closed kaisun2000 closed 3 years ago
This may be a really just slow for controller to rebalance. Try increasing the timeout value?
2020-08-14T19:41:23.6851005Z ---- before set error states! 2020-08-14T19:41:23.6860024Z ---- Enter loop for resource TestResource 2020-08-14T19:41:24.7386035Z ---- enter verifyState! 2020-08-14T19:41:24.7571720Z 4624154 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Start loading the abnormal state resolvers with configuration {} 2020-08-14T19:41:24.7573277Z 4624154 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Finish loading the abnormal state resolvers {} 2020-08-14T19:41:27.9875103Z 4627384 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event ClusterStateVerifier : Reloaded InstanceConfig for cluster UNKNOWN_CLUSTER, DEFAULT pipeline. Keys: [localhost_12919, localhost_12926, localhost_12927, localhost_12918, localhost_12922, localhost_12923, localhost_12924, localhost_12925, localhost_12920, localhost_12921] 2020-08-14T19:41:29.0332442Z 4628430 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event ClusterStateVerifier : Reloaded ResourceConfig for cluster UNKNOWN_CLUSTER, DEFAULT pipeline. Cnt: 0
2020-08-14T19:41:30.0553061Z 4629446 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(952009c2_TASK)] INFO org.apache.helix.controller.dataproviders.WorkflowControllerDataProvider - Event 952009c2_TASK : END: WorkflowControllerDataProvider.refresh() for cluster TestEnableCompression_testEnableCompressionResource, started at 1597434041207 took 48841 for TASK pipeline 2020-08-14T19:41:31.2014525Z 4630597 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(445a28b7_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 445a28b7_TASK : No ClusterConfig change for cluster TestEnableCompression_testEnableCompressionResource, pipeline TASK 2020-08-14T19:41:31.2015842Z 4630597 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(445a28b7_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 445a28b7_TASK : No ideal state change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T19:41:31.2016840Z 4630597 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(445a28b7_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 445a28b7_TASK : No live instance change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T19:41:31.2017691Z 4630597 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(445a28b7_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 445a28b7_TASK : No instance config change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T19:41:31.2022446Z 4630597 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(445a28b7_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 445a28b7_TASK : No resource config change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T19:41:51.6018228Z 4650999 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Start loading the abnormal state resolvers with configuration {} 2020-08-14T19:41:51.6029574Z 4651000 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Finish loading the abnormal state resolvers {} 2020-08-14T19:41:53.7880201Z 4653185 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 2a930739_DEFAULT : Reloaded InstanceConfig for cluster TestEnableCompression_testEnableCompressionResource, DEFAULT pipeline. Keys: [localhost_12919, localhost_12926, localhost_12927, localhost_12918, localhost_12922, localhost_12923, localhost_12924, localhost_12925, localhost_12920, localhost_12921]** 2020-08-14T19:41:53.7990366Z 4653196 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 2a930739_DEFAULT : Reloaded ResourceConfig for cluster TestEnableCompression_testEnableCompressionResource, DEFAULT pipeline. Cnt: 0 2020-08-14T19:42:05.7283887Z 4665125 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Event 2a930739_DEFAULT : Reloaded CustomizedStateConfig for cluster TestEnableCompression_testEnableCompressionResource, DEFAULT pipeline. 2020-08-14T19:42:05.7290542Z 4665126 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Remove customizedView for state: [] 2020-08-14T19:42:06.7469715Z 4666134 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Event 2a930739_DEFAULT : END: ResourceControllerDataProvider.refresh() for cluster TestEnableCompression_testEnableCompressionResource, started at 1597434111569 took 15167 for DEFAULT pipeline
2020-08-14T19:43:39.7597235Z 4759157 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Event ClusterStateVerifier : Reloaded CustomizedStateConfig for cluster TestEnableCompression_testEnableCompressionResource, DEFAULT pipeline. 2020-08-14T19:43:39.7602837Z 4759157 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Remove customizedView for state: [] 2020-08-14T19:43:39.7637117Z 4759161 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Event ClusterStateVerifier : END: ResourceControllerDataProvider.refresh() for cluster TestEnableCompression_testEnableCompressionResource, started at 1597434084738 took 135025 for DEFAULT pipeline 2020-08-14T19:43:39.7656386Z ---- before verify instances !
Reload CustomerizedStateConfig in GitHub environment took 130 secs. This maybe the bottleneck
After enable the timeout long enough, we have:
2020-08-14T23:50:42.7834929Z [ERROR] testEnableCompressionResource(org.apache.helix.integration.TestEnableCompression) Time elapsed: 361.59 s <<< FAILURE! 2020-08-14T23:50:42.7835061Z java.lang.AssertionError: expected:
but was: 2020-08-14T23:50:42.7835175Z at org.apache.helix.integration.TestEnableCompression.testEnableCompressionResource(TestEnableCompression.java:120) Put it another way, the assertion failed due to 300s default timeout.
BestPossibleExternalViewVerifier verifier =
new BestPossibleExternalViewVerifier.Builder(clusterName).setZkClient(_gZkClient)
.setExpectLiveInstances(expectedLiveInstances).setResources(expectedResources).build();
System.out.println("before verify by polling");
boolean reuslt = verifier.verifyByPolling(true); ---> this line failed.
Assert.assertTrue((reuslt));
This is due to verifyByPolling timeout
2020-08-14T23:37:40.2214816Z ---- before verify instances ! 2020-08-14T23:37:40.2215242Z ---- before get extViews! 2020-08-14T23:37:42.4176199Z ---- before add empty idealstate if not there! 2020-08-14T23:37:42.4177192Z ---- before calcualte bestPossOutput! 2020-08-14T23:37:43.4766380Z 5052117 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(62dc8393_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 62dc8393_TASK : No ClusterConfig change for cluster TestEnableCompression_testEnableCompressionResource, pipeline TASK 2020-08-14T23:37:43.4774575Z 5052118 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(62dc8393_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 62dc8393_TASK : No ideal state change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T23:37:43.4783040Z 5052119 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(62dc8393_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 62dc8393_TASK : No live instance change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T23:37:43.4789140Z 5052120 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(62dc8393_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 62dc8393_TASK : No instance config change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T23:37:43.4795757Z 5052121 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(62dc8393_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 62dc8393_TASK : No resource config change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T23:37:49.9322043Z ---- before set error states! 2020-08-14T23:37:49.9324273Z ---- Enter loop for resource TestResource 2020-08-14T23:37:50.9876666Z END testEnableCompressionResource at Fri Aug 14 23:37:50 UTC 2020, took: 361589ms.
But giving enough time to both clusterVerifier and test method, the one can pass which previous fails consistently.
The main time it spends are CustomizedStateConfig
in the verifier's first stage of _dataProvider.refresh(). It seem the VM in github CPU is slow or CPU cycle is limited.
See the log here: (file logs 259) close to 9 minutes.
2020-08-15T01:54:42.2471239Z START testEnableCompressionResource at Sat Aug 15 01:54:42 UTC 2020 2020-08-15T01:54:42.2475315Z START TestEnableCompression_testEnableCompressionResource at Sat Aug 15 01:54:42 UTC 2020 ...... 020-08-15T02:02:19.7213784Z ---- Enter loop for resource TestResource 2020-08-15T02:02:20.5237517Z compressed paths:[/TestEnableCompression_testEnableCompressionResource/EXTERNALVIEW/TestResource, /TestEnableCompression_testEnableCompressionResource/IDEALSTATES/TestResource] 2020-08-15T02:02:20.6172380Z 4858699 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(e3145653_DEFAULT)] ERROR org.apache.helix.controller.GenericHelixController - Cluster manager: controller_0 is not leader for TestEnableCompression_testEnableCompressionResource. Pipeline will not be invoked 2020-08-15T02:02:20.6173128Z 4858700 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(e3145653_TASK)] ERROR org.apache.helix.controller.GenericHelixController - Cluster manager: controller_0 is not leader for TestEnableCompression_testEnableCompressionResource. Pipeline will not be invoked 2020-08-15T02:02:20.6853124Z END TestEnableCompression_testEnableCompressionResource at Sat Aug 15 02:02:20 UTC 2020 2020-08-15T02:02:20.6856135Z END testEnableCompressionResource at Sat Aug 15 02:02:20 UTC 2020, took: 458439ms. 2020-08-15T02:02:20.6856320Z AfterClass:TestEnableCompression afterclass of ZkTestBase called! 2020-08-15T02:02:21.6858577Z TestEnableCompression has active threads cnt:739
Maybe need to review is CustomerizedStateConfig is essential
Fix in #1227; in sum, three fixes:
testng.xml have default timeout 15 minutes,
verifier also 15 minutes
verifier need to log out ERROR if failed due to time-out.
There is still some chance that this test may not complete in 15 minutes.
2020-08-21T06:13:22.1505513Z 7283789 [TestNGInvoker-testEnableCompressionResource()] ERROR org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier - verifier timeout out with timeout 900000,. stack trace [java.lang.Thread.getStackTrace(Thread.java:1559), org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier.verifyByPolling(ZkHelixClusterVerifier.java:208), org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier.verifyByPolling(ZkHelixClusterVerifier.java:229), org.apache.helix.integration.TestEnableCompression.testEnableCompressionResource(TestEnableCompression.java:119), sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method), sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62), sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43), java.lang.reflect.Method.invoke(Method.java:498), org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:76), org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:46), org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:37), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511), java.util.concurrent.FutureTask.run(FutureTask.java:266), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624), java.lang.Thread.run(Thread.java:748)]
Close test unstable tickets since we have an automatic tracking mechanism https://github.com/apache/helix/pull/1757 now for tracking the most recent test issues.
Log:
Test in intellij normally finished in 30 sec.
hangs
boolean result = verifier.verify(120000L);
// ------> hanges here.