apache / helix

Mirror of Apache Helix
Apache License 2.0
461 stars 224 forks source link

fix flaky test TestGetLastScheduledTaskExecInfo.testGetLastScheduledTaskExecInfo #1279

Closed kaisun2000 closed 3 years ago

kaisun2000 commented 4 years ago

github log:

2020-08-14T19:59:31.4007676Z [ERROR] TestGetLastScheduledTaskExecInfo.testGetLastScheduledTaskExecInfo:84->setupTasks:150 expected: but was:

2020-08-14T19:59:31.0089609Z [ERROR] testGetLastScheduledTaskExecInfo(org.apache.helix.task.TestGetLastScheduledTaskExecInfo) Time elapsed: 27.57 s <<< FAILURE! 2020-08-14T19:59:31.0093764Z java.lang.AssertionError: expected: but was: 2020-08-14T19:59:31.0100939Z at org.apache.helix.task.TestGetLastScheduledTaskExecInfo.setupTasks(TestGetLastScheduledTaskExecInfo.java:150) 2020-08-14T19:59:31.0109245Z at org.apache.helix.task.TestGetLastScheduledTaskExecInfo.testGetLastScheduledTaskExecInfo(TestGetLastScheduledTaskExecInfo.java:84)

kaisun2000 commented 4 years ago
    boolean haveExpectedNumberOfTasksScheduled = TestHelper.verify(() -> {
      int scheduleTask = 0;
      WorkflowConfig workflowConfig =
          TaskUtil.getWorkflowConfig(_manager.getHelixDataAccessor(), jobQueueName);
      for (String job : workflowConfig.getJobDag().getAllNodes()) {
        JobContext jobContext = _driver.getJobContext(job);
        Set<Integer> allPartitions = jobContext.getPartitionSet();
        for (Integer partition : allPartitions) {
          String timestamp = jobContext.getMapField(partition).get(TASK_START_TIME_KEY);
          if (timestamp != null) {
            scheduleTask++;
          }
        }
      }
      return (scheduleTask == expectedScheduledTasks);
    }, TestHelper.WAIT_DURATION);
    Assert.assertTrue(haveExpectedNumberOfTasksScheduled);  --> failure 

In setupTask, the TestHelper.WAIT_DURATION is only 20 sec, need to enlarge it.

Also, enhance logging of verify, if wait timeout, log error.

  public static boolean verify(Verifier verifier, long timeout) throws Exception {
    long start = System.currentTimeMillis();
    do {
      boolean result = verifier.verify();
      if (result || (System.currentTimeMillis() - start) > timeout) {
        return result;
      }
      Thread.sleep(50);
    } while (true);
  }
kaisun2000 commented 4 years ago

Some root cause as #1277

jiajunwang commented 3 years ago

Close test unstable tickets since we have an automatic tracking mechanism https://github.com/apache/helix/pull/1757 now for tracking the most recent test issues.