apache / helix

Mirror of Apache Helix
Apache License 2.0
465 stars 226 forks source link

Fix flaky test TestClusterInMaintenanceModeWhenReachingMaxPartition.testDisableCluster #1282

Closed kaisun2000 closed 3 years ago

kaisun2000 commented 4 years ago

LOG:

2020-08-14T23:50:43.0367342Z [ERROR] TestClusterInMaintenanceModeWhenReachingMaxPartition.testDisableCluster:119 expected: but was:

2020-08-14T23:50:42.7833847Z [ERROR] testDisableCluster(org.apache.helix.integration.rebalancer.TestClusterInMaintenanceModeWhenReachingMaxPartition) Time elapsed: 5.459 s <<< FAILURE! 2020-08-14T23:50:42.7833998Z java.lang.AssertionError: expected: but was: 2020-08-14T23:50:42.7834145Z at org.apache.helix.integration.rebalancer.TestClusterInMaintenanceModeWhenReachingMaxPartition.testDisableCluster(TestClusterInMaintenanceModeWhenReachingMaxPartition.java:119) 2020-08-14T23:50:42.7834251Z

Pure waiting, bad test? How to enhance?

  @Test
  public void testDisableCluster() throws Exception {
    ConfigAccessor configAccessor = new ConfigAccessor(_gZkClient);
    ClusterConfig clusterConfig = configAccessor.getClusterConfig(CLUSTER_NAME);
    clusterConfig.setMaxPartitionsPerInstance(10);
    configAccessor.setClusterConfig(CLUSTER_NAME, clusterConfig);

    int i = 0;
    for (String stateModel : TestStateModels) {
      String db = "Test-DB-" + i++;
      int replica = 3;
      createResourceWithDelayedRebalance(CLUSTER_NAME, db, stateModel, _PARTITIONS, replica,
          replica, -1);
      _testDBs.add(db);
    }
    Thread.sleep(100L);
    Assert.assertTrue(_clusterVerifier.verifyByPolling());

    MaintenanceSignal maintenanceSignal =
        _dataAccessor.getProperty(_dataAccessor.keyBuilder().maintenance());
    Assert.assertNull(maintenanceSignal);

    for (i = 2; i < NUM_NODE; i++) {
      _participants.get(i).syncStop();
    }

    Thread.sleep(1000L);
    maintenanceSignal = _dataAccessor.getProperty(_dataAccessor.keyBuilder().maintenance());
    Assert.assertNotNull(maintenanceSignal);
    Assert.assertNotNull(maintenanceSignal.getReason());
  }
kaisun2000 commented 4 years ago

Change to this should work.

    boolean result = TestHelper.verify(() -> {
      MaintenanceSignal ms = _dataAccessor.getProperty(_dataAccessor.keyBuilder().maintenance());
      return ms != null && ms.getReason() != null;
    }, TestHelper.WAIT_DURATION);

Also fixed similar patter in TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit

kaisun2000 commented 4 years ago

Still have

2020-08-16T06:11:19.8375558Z [ERROR] TestClusterInMaintenanceModeWhenReachingOfflineInstancesLimit.testWithOfflineInstancesLimit:164 expected: but was:

with previous test adding

    ZkHelixClusterVerifier clusterVerifier =
        new BestPossibleExternalViewVerifier.Builder(CLUSTER_NAME).setZkClient(_gZkClient).build();
    Thread.sleep(100);
    Assert.assertTrue(clusterVerifier.verifyByPolling());

  @Test(dependsOnMethods = "testWithDisabledInstancesLimit")
  public void testWithOfflineInstancesLimit() throws Exception {
    MaintenanceSignal maintenanceSignal =
        _dataAccessor.getProperty(_dataAccessor.keyBuilder().maintenance());
    Assert.assertNull(maintenanceSignal);  ---> failure

Change to Thread.sleep from 200 to 2000. Consider add a default sleep period in ZkHelixClusterVerifier.verifyByPolling().

kaisun2000 commented 4 years ago

Note, BestPossibleExternalViewVerifier depends on enablePersistBestPossibleAssignment, right? Confirm.

jiajunwang commented 3 years ago

Close test unstable tickets since we have an automatic tracking mechanism https://github.com/apache/helix/pull/1757 now for tracking the most recent test issues.