apache / helix

Mirror of Apache Helix
Apache License 2.0
461 stars 224 forks source link

Fix flaky TestEnableCompression hangs(timeout) often in github #1268

Closed kaisun2000 closed 3 years ago

kaisun2000 commented 4 years ago

Log:

2020-08-12T00:08:59.7823040Z [ERROR] testEnableCompressionResource(org.apache.helix.integration.TestEnableCompression) Time elapsed: 301.309 s <<< FAILURE! 2020-08-12T00:08:59.7832929Z org.testng.internal.thread.ThreadTimeoutException: Method org.testng.internal.TestNGMethod.testEnableCompressionResource() didn't finish within the time-out 300000 2020-08-12T00:08:59.7840902Z at org.apache.helix.integration.TestEnableCompression.testEnableCompressionResource(TestEnableCompression.java:116) 2020-08-12T00:08:59.7841208Z

2020-08-12T12:18:59.8288893Z ---------- Test Class TestPauseSignal thread leakage detected! --------------- 2020-08-12T12:19:04.7889521Z START testEnableCompressionResource at Wed Aug 12 12:19:04 UTC 2020 2020-08-12T12:19:04.7896157Z START TestEnableCompression_testEnableCompressionResource at Wed Aug 12 12:19:04 UTC 2020 2020-08-12T12:24:05.2712624Z END testEnableCompressionResource at Wed Aug 12 12:24:05 UTC 2020, took: 300482ms. 2020-08-12T12:24:05.2713094Z AfterClass:TestEnableCompression afterclass of ZkTestBase called! 2020-08-12T12:24:06.3960081Z TestEnableCompression has active threads cnt:838 2020-08-12T12:24:06.4334615Z zookeeper server threads has 4 thread

Test in intellij normally finished in 30 sec.

hangs boolean result = verifier.verify(120000L); // ------> hanges here.

public void testEnableCompressionResource() throws Exception {
    String className = TestHelper.getTestClassName();
    String methodName = TestHelper.getTestMethodName();
    String clusterName = className + "_" + methodName;

    System.out.println("START " + clusterName + " at " + new Date(System.currentTimeMillis()));

    MockParticipantManager[] participants = new MockParticipantManager[5];
    // ClusterSetup setupTool = new ClusterSetup(ZK_ADDR);
    int numNodes = 10;
    TestHelper.setupCluster(clusterName, ZK_ADDR, 12918, // participant port
        "localhost", // participant name prefix
        "TestDB", // resource name prefix
        0, // no resources, will be added later
        0, // partitions per resource
        numNodes, // number of nodes
        0, // replicas
        "OnlineOffline", false); // dont rebalance
    List<String> instancesInCluster =
        _gSetupTool.getClusterManagementTool().getInstancesInCluster(clusterName);
    String resourceName = "TestResource";
    Set<String> expectedResources = new HashSet<>();
    expectedResources.add(resourceName);
    CustomModeISBuilder customModeISBuilder = new CustomModeISBuilder(resourceName);

    int numPartitions = 10000;
    int numReplica = 3;
    customModeISBuilder.setNumPartitions(numPartitions);
    customModeISBuilder.setNumReplica(numReplica);
    customModeISBuilder.setStateModel("OnlineOffline");
    for (int p = 0; p < numPartitions; p++) {
      String partitionName = resourceName + "_" + p;
      customModeISBuilder.add(partitionName);
      for (int r = 0; r < numReplica; r++) {
        String instanceName = instancesInCluster.get((p % numNodes + r) % numNodes);
        customModeISBuilder.assignInstanceAndState(partitionName, instanceName, "ONLINE");
      }
    }

    IdealState idealstate = customModeISBuilder.build();
    idealstate.getRecord().setBooleanField("enableCompression", true);
    _gSetupTool.getClusterManagementTool().addResource(clusterName, resourceName, idealstate);

    HelixZkClient.ZkClientConfig clientConfig = new HelixZkClient.ZkClientConfig();
    clientConfig.setZkSerializer(new BytesPushThroughSerializer())
        .setOperationRetryTimeout((long) (60 * 1000)).setConnectInitTimeout(60 * 1000);
    HelixZkClient zkClient = SharedZkClientFactory.getInstance()
        .buildZkClient(new HelixZkClient.ZkConnectionConfig(ZK_ADDR), clientConfig);

    ClusterControllerManager controller =
        new ClusterControllerManager(ZK_ADDR, clusterName, "controller_0");
    controller.syncStart();

    Set<String> expectedLiveInstances = new HashSet<>();
    // start participants
    for (int i = 0; i < 5; i++) {
      String instanceName = "localhost_" + (12918 + i);
      participants[i] = new MockParticipantManager(ZK_ADDR, clusterName, instanceName);
      participants[i].syncStart();
      expectedLiveInstances.add(instanceName);
    }

    BestPossibleExternalViewVerifier verifier =
        new BestPossibleExternalViewVerifier.Builder(clusterName).setZkAddr(ZK_ADDR)
            .setExpectLiveInstances(expectedLiveInstances).setResources(expectedResources).build();
    boolean result = verifier.verify(120000L);      // ------> hanges here.
    Assert.assertTrue(result);

    List<String> compressedPaths = new ArrayList<>();
    findCompressedZNodes(zkClient, "/" + clusterName, compressedPaths);

    System.out.println("compressed paths:" + compressedPaths);
    // ONLY IDEALSTATE and EXTERNAL VIEW must be compressed
    Assert.assertEquals(compressedPaths.size(), 2);
    String idealstatePath = PropertyPathBuilder.idealState(clusterName, resourceName);
    String externalViewPath = PropertyPathBuilder.externalView(clusterName, resourceName);
    Assert.assertTrue(compressedPaths.contains(idealstatePath));
    Assert.assertTrue(compressedPaths.contains(externalViewPath));

    // clean up
    controller.syncStop();
    for (int i = 0; i < 5; i++) {
      participants[i].syncStop();
    }
kaisun2000 commented 4 years ago

This may be a really just slow for controller to rebalance. Try increasing the timeout value?

kaisun2000 commented 4 years ago

2020-08-14T19:41:23.6851005Z ---- before set error states! 2020-08-14T19:41:23.6860024Z ---- Enter loop for resource TestResource 2020-08-14T19:41:24.7386035Z ---- enter verifyState! 2020-08-14T19:41:24.7571720Z 4624154 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Start loading the abnormal state resolvers with configuration {} 2020-08-14T19:41:24.7573277Z 4624154 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Finish loading the abnormal state resolvers {} 2020-08-14T19:41:27.9875103Z 4627384 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event ClusterStateVerifier : Reloaded InstanceConfig for cluster UNKNOWN_CLUSTER, DEFAULT pipeline. Keys: [localhost_12919, localhost_12926, localhost_12927, localhost_12918, localhost_12922, localhost_12923, localhost_12924, localhost_12925, localhost_12920, localhost_12921] 2020-08-14T19:41:29.0332442Z 4628430 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event ClusterStateVerifier : Reloaded ResourceConfig for cluster UNKNOWN_CLUSTER, DEFAULT pipeline. Cnt: 0

2020-08-14T19:41:30.0553061Z 4629446 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(952009c2_TASK)] INFO org.apache.helix.controller.dataproviders.WorkflowControllerDataProvider - Event 952009c2_TASK : END: WorkflowControllerDataProvider.refresh() for cluster TestEnableCompression_testEnableCompressionResource, started at 1597434041207 took 48841 for TASK pipeline 2020-08-14T19:41:31.2014525Z 4630597 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(445a28b7_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 445a28b7_TASK : No ClusterConfig change for cluster TestEnableCompression_testEnableCompressionResource, pipeline TASK 2020-08-14T19:41:31.2015842Z 4630597 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(445a28b7_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 445a28b7_TASK : No ideal state change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T19:41:31.2016840Z 4630597 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(445a28b7_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 445a28b7_TASK : No live instance change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T19:41:31.2017691Z 4630597 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(445a28b7_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 445a28b7_TASK : No instance config change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T19:41:31.2022446Z 4630597 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(445a28b7_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 445a28b7_TASK : No resource config change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T19:41:51.6018228Z 4650999 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Start loading the abnormal state resolvers with configuration {} 2020-08-14T19:41:51.6029574Z 4651000 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Finish loading the abnormal state resolvers {} 2020-08-14T19:41:53.7880201Z 4653185 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 2a930739_DEFAULT : Reloaded InstanceConfig for cluster TestEnableCompression_testEnableCompressionResource, DEFAULT pipeline. Keys: [localhost_12919, localhost_12926, localhost_12927, localhost_12918, localhost_12922, localhost_12923, localhost_12924, localhost_12925, localhost_12920, localhost_12921]** 2020-08-14T19:41:53.7990366Z 4653196 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 2a930739_DEFAULT : Reloaded ResourceConfig for cluster TestEnableCompression_testEnableCompressionResource, DEFAULT pipeline. Cnt: 0 2020-08-14T19:42:05.7283887Z 4665125 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Event 2a930739_DEFAULT : Reloaded CustomizedStateConfig for cluster TestEnableCompression_testEnableCompressionResource, DEFAULT pipeline. 2020-08-14T19:42:05.7290542Z 4665126 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Remove customizedView for state: [] 2020-08-14T19:42:06.7469715Z 4666134 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(2a930739_DEFAULT)] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Event 2a930739_DEFAULT : END: ResourceControllerDataProvider.refresh() for cluster TestEnableCompression_testEnableCompressionResource, started at 1597434111569 took 15167 for DEFAULT pipeline

2020-08-14T19:43:39.7597235Z 4759157 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Event ClusterStateVerifier : Reloaded CustomizedStateConfig for cluster TestEnableCompression_testEnableCompressionResource, DEFAULT pipeline. 2020-08-14T19:43:39.7602837Z 4759157 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Remove customizedView for state: [] 2020-08-14T19:43:39.7637117Z 4759161 [TestNGInvoker-testEnableCompressionResource()] INFO org.apache.helix.controller.dataproviders.ResourceControllerDataProvider - Event ClusterStateVerifier : END: ResourceControllerDataProvider.refresh() for cluster TestEnableCompression_testEnableCompressionResource, started at 1597434084738 took 135025 for DEFAULT pipeline 2020-08-14T19:43:39.7656386Z ---- before verify instances !

kaisun2000 commented 4 years ago

Reload CustomerizedStateConfig in GitHub environment took 130 secs. This maybe the bottleneck

kaisun2000 commented 4 years ago

After enable the timeout long enough, we have:

2020-08-14T23:50:42.7834929Z [ERROR] testEnableCompressionResource(org.apache.helix.integration.TestEnableCompression) Time elapsed: 361.59 s <<< FAILURE! 2020-08-14T23:50:42.7835061Z java.lang.AssertionError: expected: but was: 2020-08-14T23:50:42.7835175Z at org.apache.helix.integration.TestEnableCompression.testEnableCompressionResource(TestEnableCompression.java:120) Put it another way, the assertion failed due to 300s default timeout.

    BestPossibleExternalViewVerifier verifier =
        new BestPossibleExternalViewVerifier.Builder(clusterName).setZkClient(_gZkClient)
            .setExpectLiveInstances(expectedLiveInstances).setResources(expectedResources).build();

    System.out.println("before verify by polling");
    boolean reuslt = verifier.verifyByPolling(true);  ---> this line failed.
    Assert.assertTrue((reuslt));

This is due to verifyByPolling timeout

2020-08-14T23:37:40.2214816Z ---- before verify instances ! 2020-08-14T23:37:40.2215242Z ---- before get extViews! 2020-08-14T23:37:42.4176199Z ---- before add empty idealstate if not there! 2020-08-14T23:37:42.4177192Z ---- before calcualte bestPossOutput! 2020-08-14T23:37:43.4766380Z 5052117 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(62dc8393_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 62dc8393_TASK : No ClusterConfig change for cluster TestEnableCompression_testEnableCompressionResource, pipeline TASK 2020-08-14T23:37:43.4774575Z 5052118 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(62dc8393_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 62dc8393_TASK : No ideal state change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T23:37:43.4783040Z 5052119 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(62dc8393_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 62dc8393_TASK : No live instance change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T23:37:43.4789140Z 5052120 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(62dc8393_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 62dc8393_TASK : No instance config change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T23:37:43.4795757Z 5052121 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(62dc8393_TASK)] INFO org.apache.helix.controller.dataproviders.BaseControllerDataProvider - Event 62dc8393_TASK : No resource config change for TestEnableCompression_testEnableCompressionResource cluster, TASK pipeline 2020-08-14T23:37:49.9322043Z ---- before set error states! 2020-08-14T23:37:49.9324273Z ---- Enter loop for resource TestResource 2020-08-14T23:37:50.9876666Z END testEnableCompressionResource at Fri Aug 14 23:37:50 UTC 2020, took: 361589ms.

kaisun2000 commented 4 years ago

But giving enough time to both clusterVerifier and test method, the one can pass which previous fails consistently.

The main time it spends are CustomizedStateConfig in the verifier's first stage of _dataProvider.refresh(). It seem the VM in github CPU is slow or CPU cycle is limited.

See the log here: (file logs 259) close to 9 minutes.

2020-08-15T01:54:42.2471239Z START testEnableCompressionResource at Sat Aug 15 01:54:42 UTC 2020 2020-08-15T01:54:42.2475315Z START TestEnableCompression_testEnableCompressionResource at Sat Aug 15 01:54:42 UTC 2020 ...... 020-08-15T02:02:19.7213784Z ---- Enter loop for resource TestResource 2020-08-15T02:02:20.5237517Z compressed paths:[/TestEnableCompression_testEnableCompressionResource/EXTERNALVIEW/TestResource, /TestEnableCompression_testEnableCompressionResource/IDEALSTATES/TestResource] 2020-08-15T02:02:20.6172380Z 4858699 [HelixController-pipeline-default-TestEnableCompression_testEnableCompressionResource-(e3145653_DEFAULT)] ERROR org.apache.helix.controller.GenericHelixController - Cluster manager: controller_0 is not leader for TestEnableCompression_testEnableCompressionResource. Pipeline will not be invoked 2020-08-15T02:02:20.6173128Z 4858700 [HelixController-pipeline-task-TestEnableCompression_testEnableCompressionResource-(e3145653_TASK)] ERROR org.apache.helix.controller.GenericHelixController - Cluster manager: controller_0 is not leader for TestEnableCompression_testEnableCompressionResource. Pipeline will not be invoked 2020-08-15T02:02:20.6853124Z END TestEnableCompression_testEnableCompressionResource at Sat Aug 15 02:02:20 UTC 2020 2020-08-15T02:02:20.6856135Z END testEnableCompressionResource at Sat Aug 15 02:02:20 UTC 2020, took: 458439ms. 2020-08-15T02:02:20.6856320Z AfterClass:TestEnableCompression afterclass of ZkTestBase called! 2020-08-15T02:02:21.6858577Z TestEnableCompression has active threads cnt:739

kaisun2000 commented 4 years ago

Maybe need to review is CustomerizedStateConfig is essential

kaisun2000 commented 4 years ago

Fix in #1227; in sum, three fixes:

kaisun2000 commented 4 years ago

There is still some chance that this test may not complete in 15 minutes.

2020-08-21T06:13:22.1505513Z 7283789 [TestNGInvoker-testEnableCompressionResource()] ERROR org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier - verifier timeout out with timeout 900000,. stack trace [java.lang.Thread.getStackTrace(Thread.java:1559), org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier.verifyByPolling(ZkHelixClusterVerifier.java:208), org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier.verifyByPolling(ZkHelixClusterVerifier.java:229), org.apache.helix.integration.TestEnableCompression.testEnableCompressionResource(TestEnableCompression.java:119), sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method), sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62), sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43), java.lang.reflect.Method.invoke(Method.java:498), org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:76), org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:46), org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:37), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511), java.util.concurrent.FutureTask.run(FutureTask.java:266), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624), java.lang.Thread.run(Thread.java:748)]

jiajunwang commented 3 years ago

Close test unstable tickets since we have an automatic tracking mechanism https://github.com/apache/helix/pull/1757 now for tracking the most recent test issues.