apache / helix

Mirror of Apache Helix
Apache License 2.0
461 stars 224 forks source link

Flaky test: TestCleanupExternalView.test #1661

Closed huizhilu closed 3 years ago

huizhilu commented 3 years ago

Problem

The test is flaky. It may be due to thread leakage.

[ERROR] test(org.apache.helix.integration.TestCleanupExternalView)  Time elapsed: 60.628 s  <<< FAILURE!
java.lang.AssertionError: external-view for TestDB0 should be removed, but was: ZnRecord=TestDB0, {BUCKET_SIZE=0, IDEAL_STATE_MODE=AUTO, NUM_PARTITIONS=2, REBALANCE_MODE=SEMI_AUTO, REBALANCE_STRATEGY=DEFAULT, REPLICAS=2, STATE_MODEL_DEF_REF=MasterSlave, STATE_MODEL_FACTORY_NAME=DEFAULT}{TestDB0_0={localhost_12918=SLAVE, localhost_12919=MASTER}, TestDB0_1={localhost_12918=MASTER, localhost_12919=SLAVE}}{}, Stat=Stat {_version=5, _creationTime=1614716622978, _modifiedTime=1614716623193, _ephemeralOwner=0} expected:<true> but was:<false>
        at org.apache.helix.integration.TestCleanupExternalView.test(TestCleanupExternalView.java:118)

[INFO]
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR]   TestCleanupExternalView.test:118 external-view for TestDB0 should be removed, but was: ZnRecord=TestDB0, {BUCKET_SIZE=0, IDEAL_STATE_MODE=AUTO, NUM_PARTITIONS=2, REBALANCE_MODE=SEMI_AUTO, REBALANCE_STRATEGY=DEFAULT, REPLICAS=2, STATE_MODEL_DEF_REF=MasterSlave, STATE_MODEL_FACTORY_NAME=DEFAULT}{TestDB0_0={localhost_12918=SLAVE, localhost_12919=MASTER}, TestDB0_1={localhost_12918=MASTER, localhost_12919=SLAVE}}{}, Stat=Stat {_version=5, _creationTime=1614716622978, _modifiedTime=1614716623193, _ephemeralOwner=0} expected:<true> but was:<false>

BeforeClass: TestCleanupExternalView called.
START TestCleanupExternalView test at Tue Mar 02 12:23:42 PST 2021
START TestCleanupExternalView_test at Tue Mar 02 12:23:42 PST 2021
pending zk-events in queue: []
3452624 [TestNGInvoker-test()] ERROR org.apache.helix.TestHelper  - verifier time out, consider try longer timeout, stack trace[java.lang.Thread.getStackTrace(Thread.java:1556), org.apache.helix.TestHelper.verify(TestHelper.java:813), org.apache.helix.integration.TestCleanupExternalView.test(TestCleanupExternalView.java:113), sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method), sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62), sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43), java.lang.reflect.Method.invoke(Method.java:498), org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:76), org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:46), org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:37), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511), java.util.concurrent.FutureTask.run(FutureTask.java:266), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617), java.lang.Thread.run(Thread.java:745)]
END TestCleanupExternalView test at Tue Mar 02 12:24:43 PST 2021, took: 60627ms.
AfterClass: TestCleanupExternalView called.
************ SYSTEM Physical Memory:67386290176
************ total memory:3954 free memory:3148
---------- Test Class TestCleanupExternalView thread leakage detected! ---------------
jiajunwang commented 3 years ago

The test logic is not very up-to-date. IMO, the original test was trying to test if EV is removed when all CSs are gone. However, the current logic does not really care about CS existence. It only remove EV under two conditions,

  1. isExternalViewDisabled == true
  2. The resource has been dropped

So we should change the test logic accordingly. Though, according to the code logic, the test shall still pass since we drop the resource first. There might be a race condition that brings the EV back. Need more investigation on condition 2.

jiajunwang commented 3 years ago

Ok, the theory is that the dedup queue for the Async Stages skip some of the EV update. This happens after we greatly boost pipeline speed. So it is possible that during the 1st EV update event is processing, the 2nd EV update event is still in the dedup queue. Then we started to remove the resource. The corresponding EV update event (3rd) is deduped. Same situation might happen for the other EV update events.

Given the cached data in the ClusterEvent object won't be refreshed when the EV update event is picked up, I don't think we should use dedup queue for this specific Stage at least.

jiajunwang commented 3 years ago

As discussed, we will change or disable the test for now to workaround this issue.

jiajunwang commented 3 years ago

Ok, the theory is that the dedup queue for the Async Stages skip some of the EV update. This happens after we greatly boost pipeline speed. So it is possible that during the 1st EV update event is processing, the 2nd EV update event is still in the dedup queue. Then we started to remove the resource. The corresponding EV update event (3rd) is deduped. Same situation might happen for the other EV update events.

Given the cached data in the ClusterEvent object won't be refreshed when the EV update event is picked up, I don't think we should use dedup queue for this specific Stage at least.

Since the dedup queue actually dedups the prior events, there is no concern that the newer event is ignored. And since all Helix logic assume that a later operation can resolve and catch up data gaps between different notifications, so there is no real logic problem here.