apache / helix

Mirror of Apache Helix
Apache License 2.0
465 stars 226 forks source link

Fix flaky test TestResourceChangeDetector #1281

Closed kaisun2000 closed 3 years ago

kaisun2000 commented 4 years ago

LOG:

2020-08-14T10:56:36.2278432Z [ERROR] TestResourceChangeDetector.testResetSnapshots:449 expected:<0> but was:<1>

2020-08-14T10:56:35.8572321Z [ERROR] testResetSnapshots(org.apache.helix.controller.changedetector.TestResourceChangeDetector) Time elapsed: 0.107 s <<< FAILURE! 2020-08-14T10:56:35.8573105Z java.lang.AssertionError: expected:<0> but was:<1> 2020-08-14T10:56:35.8580562Z at org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:449) 2020-08-14T10:56:35.8580884Z

kaisun2000 commented 4 years ago

Another problem is not close StrictMatchExternalViewVerifier, leak resource.

THis one needs further examination.

kaisun2000 commented 4 years ago

Failed at least 3 times. Need further examination.

kaisun2000 commented 4 years ago

Added more log in StrictMatchExternalViewVerifier, trigger callback. Change waiting time to 15 mins , same as timeout for test.

kaisun2000 commented 4 years ago

1227

kaisun2000 commented 4 years ago

2020-08-21T06:26:35.4853497Z [ERROR] testResetSnapshots(org.apache.helix.controller.changedetector.TestResourceChangeDetector) Time elapsed: 900.002 s <<< FAILURE! 2020-08-21T06:26:35.4866531Z org.testng.internal.thread.ThreadTimeoutException: Method org.testng.internal.TestNGMethod.testResetSnapshots() didn't finish within the time-out 900000 2020-08-21T06:26:35.4878804Z at org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:434)

Error:

2020-08-21T04:49:04.9996955Z AfterClass: TestResourceChangeDetector of TestResourceChangeDetector called. 2020-08-21T04:49:05.0036922Z 2226637 [TestNGInvoker-testResetSnapshots()] ERROR org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier - Exception java.lang.InterruptedException in verifier [java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1041), java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1332), java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277), org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier.verifyByCallback(ZkHelixClusterVerifier.java:249), org.apache.helix.tools.ClusterVerifiers.StrictMatchExternalViewVerifier.verifyByZkCallback(StrictMatchExternalViewVerifier.java:200), org.apache.helix.tools.ClusterVerifiers.StrictMatchExternalViewVerifier.verify(StrictMatchExternalViewVerifier.java:179), org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:434), sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method), sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62), sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43), java.lang.reflect.Method.invoke(Method.java:498), org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:76), org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:46), org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:37), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511), java.util.concurrent.FutureTask.run(FutureTask.java:266), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624), java.lang.Thread.run(Thread.java:748)] 2020-08-21T04:49:05.0420974Z 2226679 [HelixController-pipeline-default-TestResourceChangeDetector-(9d444036_DEFAULT)] ERROR org.apache.helix.zookeeper.zkclient.callback.ZkAsyncCallbacks - Interrupted waiting for success 2020-08-21T04:49:05.0421621Z java.lang.InterruptedException 2020-08-21T04:49:05.0422019Z at java.lang.Object.wait(Native Method) 2020-08-21T04:49:05.0422239Z at java.lang.Object.wait(Object.java:502) 2020-08-21T04:49:05.0422373Z at org.apache.helix.zookeeper.zkclient.callback.ZkAsyncCallbacks$DefaultCallback.waitForSuccess(ZkAsyncCallbacks.java:220) 2020-08-21T04:49:05.0422511Z at org.apache.helix.manager.zk.ZkBaseDataAccessor.getStats(ZkBaseDataAccessor.java:1197) 2020-08-21T04:49:05.0422644Z at org.apache.helix.manager.zk.ZKHelixDataAccessor.getPropertyStats(ZKHelixDataAccessor.java:365) 2020-08-21T04:49:05.0422782Z at org.apache.helix.common.caches.AbstractDataCache.refreshProperties(AbstractDataCache.java:66) 2020-08-21T04:49:05.0422913Z at org.apache.helix.common.caches.PropertyCache.doRefreshWithSelectiveUpdate(PropertyCache.java:175) 2020-08-21T04:49:05.0423042Z at org.apache.helix.common.caches.PropertyCache.refresh(PropertyCache.java:158) 2020-08-21T04:49:05.0423181Z at org.apache.helix.controller.dataproviders.BaseControllerDataProvider.refreshIdealState(BaseControllerDataProvider.java:244) 2020-08-21T04:49:05.0423328Z at org.apache.helix.controller.dataproviders.BaseControllerDataProvider.doRefresh(BaseControllerDataProvider.java:329) 2020-08-21T04:49:05.0423684Z at org.apache.helix.controller.dataproviders.ResourceControllerDataProvider.refresh(ResourceControllerDataProvider.java:143) 2020-08-21T04:49:05.0423822Z at org.apache.helix.controller.stages.ReadClusterDataStage.process(ReadClusterDataStage.java:63) 2020-08-21T04:49:05.0424047Z at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:68) 2020-08-21T04:49:05.0424159Z at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:777) 2020-08-21T04:49:05.0424292Z at org.apache.helix.controller.GenericHelixController.access$500(GenericHelixController.java:128) 2020-08-21T04:49:05.0424428Z at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1407) 2020-08-21T04:49:26.1087469Z 2247744 [main] ERROR org.apache.helix.TestHelper - verifier time out, consider try longer timeout, stack trace[java.lang.Thread.getStackTrace(Thread.java:1559), org.apache.helix.TestHelper.verify(TestHelper.java:804), org.apache.helix.controller.changedetector.TestResourceChangeDetector.afterClass(TestResourceChangeDetector.java:124), sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method), sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62), sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43), java.lang.reflect.Method.invoke(Method.java:498), org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:76), org.testng.internal.Invoker.invokeConfigurationMethod(Invoker.java:525), org.testng.internal.Invoker.invokeConfigurations(Invoker.java:202), org.testng.internal.Invoker.invokeConfigurations(Invoker.java:130), org.testng.internal.TestMethodWorker.invokeAfterClassMethods(TestMethodWorker.java:222), org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112), org.testng.TestRunner.runWorkers(TestRunner.java:1147), org.testng.TestRunner.privateRun(TestRunner.java:749), org.testng.TestRunner.run(TestRunner.java:600), org.testng.SuiteRunner.runTest(SuiteRunner.java:317), org.testng.SuiteRunner.runSequentially(SuiteRunner.java:312), org.testng.SuiteRunner.privateRun(SuiteRunner.java:274), org.testng.SuiteRunner.run(SuiteRunner.java:223), org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52), org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86), org.testng.TestNG.runSuitesSequentially(TestNG.java:1039), org.testng.TestNG.runSuitesLocally(TestNG.java:964), org.testng.TestNG.run(TestNG.java:900), org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:283), org.apache.maven.surefire.testng.TestNGXmlTestSuite.execute(TestNGXmlTestSuite.java:75), org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:120), org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384), org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345), org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126), org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)] 2020-08-21T04:49:26.1090128Z AfterClass:TestResourceChangeDetector afterclass of ZkTestBase called! 2020-08-21T04:49:27.1047418Z TestResourceChangeDetector has active threads cnt:207 2020-08-21T04:49:27.1146302Z zookeeper server threads has 4 thread

Solution: 1/ enlarge timeout or 2/ not use callback verifier.

kaisun2000 commented 4 years ago

LOG: (390)

2020-08-22T01:34:28.9553118Z START TestResourceChangeDetector testResetSnapshots at Sat Aug 22 01:34:28 UTC 2020 2020-08-22T01:54:28.9750235Z 2646766 [TestNGInvoker-testResetSnapshots()] ERROR org.apache.helix.tools.ClusterVerifiers.ZkHelixClusterVerifier - trigger verify timeout failing, with stack trace java.lang.Thread.getStackTrace(Thread.java:1559) 2020-08-22T01:54:28.9813525Z END TestResourceChangeDetector testResetSnapshots at Sat Aug 22 01:54:28 UTC 2020, took: 1200025ms. 2020-08-22T01:54:28.9814055Z AfterClass: TestResourceChangeDetector of TestResourceChangeDetector called.

kaisun2000 commented 4 years ago

trigger based callback handler may lose callback. Waiting 20 minutes not working.

Change to use poll based verifier.

kaisun2000 commented 4 years ago

one more time: log 907

2020-09-15T23:51:12.8993411Z [ERROR] Tests run: 1197, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 4,741.919 s <<< FAILURE! - in TestSuite 2020-09-15T23:51:12.8995084Z [ERROR] testResetSnapshots(org.apache.helix.controller.changedetector.TestResourceChangeDetector) Time elapsed: 300.038 s <<< FAILURE! 2020-09-15T23:51:12.8996657Z java.lang.AssertionError: expected:< true > but was:< false> 2020-09-15T23:51:12.8998840Z at org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:438) 2020-09-15T23:51:12.9000831Z 2020-09-15T23:51:13.2757259Z [ERROR] Failures: 2020-09-15T23:51:13.2760957Z [ERROR] TestResourceChangeDetector.testResetSnapshots:438 expected: but was: 2020-09-15T23:51:13.2764093Z [ERROR] Tests run: 1197, Failures: 1, Errors: 0, Skipped: 1

kaisun2000 commented 4 years ago

another time: log 910

2020-09-16T00:02:24.5744481Z [ERROR] Tests run: 1197, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 5,053.729 s <<< FAILURE! - in TestSuite 2020-09-16T00:02:24.5756583Z [ERROR] testResetSnapshots(org.apache.helix.controller.changedetector.TestResourceChangeDetector) Time elapsed: 300.064 s <<< FAILURE! 2020-09-16T00:02:24.5762923Z java.lang.AssertionError: expected:< true > but was:< false > 2020-09-16T00:02:24.5774608Z at org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:438) 2020-09-16T00:02:24.5778019Z 2020-09-16T00:02:24.9729877Z [ERROR] Failures: 2020-09-16T00:02:24.9732038Z [ERROR] TestResourceChangeDetector.testResetSnapshots:438 expected:< true > but was:< false > 2020-09-16T00:02:24.9735460Z [ERROR] Tests run: 1197, Failures: 1, Errors: 0, Skipped: 1 2020-09-16T00:02:24.9803925Z [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M3:test (default-test) on project helix-core: There are test failures. 2020-09-16T00:02:24.9816989Z [ERROR]

kaisun2000 commented 4 years ago

increase time out as now verify would timeout in 5 mintues by default.

kaisun2000 commented 4 years ago

log 930 first batch of waitTillVerify

2020-09-17T01:20:49.7767264Z [ERROR] TestResourceChangeDetector.testResetSnapshots:437 » ThreadTimeout Method org.t...

code

  public void testResetSnapshots() {
    // ensure the cluster converged before the test to ensure IS is not modified unexpectedly
    ZkHelixClusterVerifier _clusterVerifier =
        new StrictMatchExternalViewVerifier.Builder(CLUSTER_NAME).setZkAddress(ZK_ADDR)
            .setDeactivatedNodeAwareness(true)
            .setResources(new HashSet<>(_dataAccessor.getChildNames(_keyBuilder.idealStates())))
            .setWaitTillVerify(TestHelper.DEFAULT_REBALANCE_PROCESSING_WAIT_TIME)
            .build();
    try {
      boolean result = _clusterVerifier.verify(WAIT_TIME);   ----- this line.
      Assert.assertTrue(result);
    } finally {
      _clusterVerifier.close();
    }
kaisun2000 commented 4 years ago

2020-09-17T01:20:49.3991835Z [ERROR] testResetSnapshots(org.apache.helix.controller.changedetector.TestResourceChangeDetector) Time elapsed: 1,800.007 s <<< FAILURE! 2020-09-17T01:20:49.3995272Z org.testng.internal.thread.ThreadTimeoutException: Method org.testng.internal.TestNGMethod.testResetSnapshots() didn't finish within the time-out 1800000 2020-09-17T01:20:49.3999251Z at org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:437) 2020-09-17T01:20:49.4001707Z 2020-09-17T01:20:49.4003183Z [ERROR]

kaisun2000 commented 4 years ago

LOG 969, touch 3

2020-09-17T09:51:57.3363039Z [ERROR] testResetSnapshots(org.apache.helix.controller.changedetector.TestResourceChangeDetector) Time elapsed: 1,800.015 s <<< FAILURE! 2020-09-17T09:51:57.3413951Z org.testng.internal.thread.ThreadTimeoutException: Method org.testng.internal.TestNGMethod.testResetSnapshots() didn't finish within the time-out 1800000 2020-09-17T09:51:57.3418048Z at org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:437) 2020-09-17T09:51:57.3420607Z 2020-09-17T09:51:57.7154562Z [ERROR] Failures: 2020-09-17T09:51:57.7156522Z [ERROR] TestResourceChangeDetector.testResetSnapshots:437 » ThreadTimeout Method org.t...

kaisun2000 commented 4 years ago

LOG 973, touch 5

2020-09-17T09:51:57.3363039Z [ERROR] testResetSnapshots(org.apache.helix.controller.changedetector.TestResourceChangeDetector) Time elapsed: 1,800.015 s <<< FAILURE! 2020-09-17T09:51:57.3413951Z org.testng.internal.thread.ThreadTimeoutException: Method org.testng.internal.TestNGMethod.testResetSnapshots() didn't finish within the time-out 1800000 2020-09-17T09:51:57.3418048Z at org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:437) 2020-09-17T09:51:57.3420607Z 2020-09-17T09:51:57.7154562Z [ERROR] Failures: 2020-09-17T09:51:57.7156522Z [ERROR] TestResourceChangeDetector.testResetSnapshots:437 » ThreadTimeout Method org.t...

kaisun2000 commented 4 years ago

LOG 1051, 09/18 touch 3

2020-09-18T11:14:08.7558906Z [ERROR] testResetSnapshots(org.apache.helix.controller.changedetector.TestResourceChangeDetector) Time elapsed: 1,800.014 s <<< FAILURE! 2020-09-18T11:14:08.7685856Z org.testng.internal.thread.ThreadTimeoutException: Method org.testng.internal.TestNGMethod.testResetSnapshots() didn't finish within the time-out 1800000 2020-09-18T11:14:08.7698996Z at org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:437) 2020-09-18T11:14:08.7702496Z 2020-09-18T11:14:09.1585373Z [ERROR] Failures: 2020-09-18T11:14:09.1588655Z [ERROR] TestResourceChangeDetector.testResetSnapshots:437 » ThreadTimeout Method org.t... 2020-09-18T11:14:09.1590014Z [ERROR] Tests run: 1196, Failures: 1, Errors: 0, Skipped: 1

kaisun2000 commented 4 years ago

LOG 1061, 09/18 touch 9

2020-09-18T11:18:01.3541498Z [ERROR] TestResourceChangeDetector.testResetSnapshots:437 » ThreadTimeout Method org.t... 2020-09-18T11:18:01.3547252Z [ERROR] Tests run: 1196, Failures: 1, Errors: 0, Skipped: 1

kaisun2000 commented 4 years ago

LOG 1118. fix #1367

2020-09-21T04:56:36.3191026Z [ERROR] testResetSnapshots(org.apache.helix.controller.changedetector.TestResourceChangeDetector) Time elapsed: 1,800.026 s <<< FAILURE! 2020-09-21T04:56:36.3203620Z org.testng.internal.thread.ThreadTimeoutException: Method org.testng.internal.TestNGMethod.testResetSnapshots() didn't finish within the time-out 1800000 2020-09-21T04:56:36.3215445Z at org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:437) 2020-09-21T04:56:36.3219010Z 2020-09-21T04:56:36.6976619Z [ERROR] Failures: 2020-09-21T04:56:36.6982617Z [ERROR] TestResourceChangeDetector.testResetSnapshots:437 » ThreadTimeout Method org.t... 2020-09-21T04:56:36.7005986Z [ERROR] Tests run: 1196, Failures: 1, Errors: 0, Skipped: 1 2020-09-21T04:56:36.7054647Z [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M3:test (default-test) on project helix-core: There are test failures.

kaisun2000 commented 4 years ago

LOG 1113 09/20 touch 8

2020-09-20T22:23:33.9435500Z [ERROR] testResetSnapshots(org.apache.helix.controller.changedetector.TestResourceChangeDetector) Time elapsed: 1,800.016 s <<< FAILURE! 2020-09-20T22:23:33.9447133Z org.testng.internal.thread.ThreadTimeoutException: Method org.testng.internal.TestNGMethod.testResetSnapshots() didn't finish within the time-out 1800000 2020-09-20T22:23:33.9466054Z at org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:437) 2020-09-20T22:23:33.9468916Z 2020-09-20T22:23:34.3349605Z [ERROR] Failures: 2020-09-20T22:23:34.3355817Z [ERROR] TestResourceChangeDetector.testResetSnapshots:437 » ThreadTimeout Method org.t... 2020-09-20T22:23:34.3357545Z [ERROR] Tests run: 1196, Failures: 1, Errors: 0, Skipped: 1 2020-09-20T22:23:34.3438853Z [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M3:test (default-test) on project helix-core: There are test failures. 2020-09-20T22:23:34.3464788Z [ERROR]

kaisun2000 commented 4 years ago

LOG 1109, 09/20 touch 5

2020-09-20T22:19:50.9503891Z [ERROR] Tests run: 1196, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 6,338.181 s <<< FAILURE! - in TestSuite 2020-09-20T22:19:50.9506411Z [ERROR] testResetSnapshots(org.apache.helix.controller.changedetector.TestResourceChangeDetector) Time elapsed: 1,800.013 s <<< FAILURE! 2020-09-20T22:19:50.9511011Z org.testng.internal.thread.ThreadTimeoutException: Method org.testng.internal.TestNGMethod.testResetSnapshots() didn't finish within the time-out 1800000 2020-09-20T22:19:50.9514789Z at org.apache.helix.controller.changedetector.TestResourceChangeDetector.testResetSnapshots(TestResourceChangeDetector.java:440) 2020-09-20T22:19:50.9517051Z 2020-09-20T22:19:51.3291033Z [ERROR] Failures: 2020-09-20T22:19:51.3298448Z [ERROR] TestResourceChangeDetector.testResetSnapshots:440->Object.wait:-2 » ThreadTimeout 2020-09-20T22:19:51.3299552Z [ERROR] Tests run: 1196, Failures: 1, Errors: 0, Skipped: 1 2020-09-20T22:19:51.3372077Z [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M3:test (default-test) on project helix-core: There are test failures.

kaisun2000 commented 4 years ago

public void testIgnoreNonTopologyChanges()

delete full auto resource added.

jiajunwang commented 3 years ago

Close test unstable tickets since we have an automatic tracking mechanism https://github.com/apache/helix/pull/1757 now for tracking the most recent test issues.