Closed huizhilu closed 3 years ago
The test logic is not very up-to-date. IMO, the original test was trying to test if EV is removed when all CSs are gone. However, the current logic does not really care about CS existence. It only remove EV under two conditions,
So we should change the test logic accordingly. Though, according to the code logic, the test shall still pass since we drop the resource first. There might be a race condition that brings the EV back. Need more investigation on condition 2.
Ok, the theory is that the dedup queue for the Async Stages skip some of the EV update. This happens after we greatly boost pipeline speed. So it is possible that during the 1st EV update event is processing, the 2nd EV update event is still in the dedup queue. Then we started to remove the resource. The corresponding EV update event (3rd) is deduped. Same situation might happen for the other EV update events.
Given the cached data in the ClusterEvent object won't be refreshed when the EV update event is picked up, I don't think we should use dedup queue for this specific Stage at least.
As discussed, we will change or disable the test for now to workaround this issue.
Ok, the theory is that the dedup queue for the Async Stages skip some of the EV update. This happens after we greatly boost pipeline speed. So it is possible that during the 1st EV update event is processing, the 2nd EV update event is still in the dedup queue. Then we started to remove the resource. The corresponding EV update event (3rd) is deduped. Same situation might happen for the other EV update events.
Given the cached data in the ClusterEvent object won't be refreshed when the EV update event is picked up, I don't think we should use dedup queue for this specific Stage at least.
Since the dedup queue actually dedups the prior events, there is no concern that the newer event is ignored. And since all Helix logic assume that a later operation can resolve and catch up data gaps between different notifications, so there is no real logic problem here.
Problem
The test is flaky. It may be due to thread leakage.