Fix `annotate_test_failures` for flaky retries which ultimately succeed

AliSoftware commented 1 year ago

What?

Fixes annotate_test_failures command, to ignore cases of flaky retries which ultimately succeeded.

This bug was the cause for some CI builds to still get an annotation mentioning test failures… even when the corresponding CI step ended up green (See this example build):

(cc @mokagio & @crazytonyli as I talked about this bug very recently with both of you, in P2s and PR comments respectively)

Why?

When a flaky test failed but Xcode is configured to auto-retry tests multiple times on failure, all the intermediate failures are recorded in the report.junit alongside the final success (if any). For example this is an extract of such .junit report:

    <testcase classname='WordPressTest.TenorDataSourceTests' name='testDataSourceReceivesRequestedCount'>
      <failure message='Asynchronous wait failed: Exceeded timeout of 2 seconds, with unfulfilled expectations: &quot;Waiting&quot;.'>&lt;unknown&gt;:0</failure>
    </testcase>
    <testcase classname='WordPressTest.TenorDataSourceTests' name='testDataSourceReceivesRequestedCount'>
      <failure message='XCTAssertEqual failed: (&quot;Optional(0)&quot;) is not equal to (&quot;Optional(3)&quot;)'>WordPress/WordPressTest/MediaPicker/Tenor/TenorDataSouceTests.swift:37</failure>
    </testcase>
    <testcase classname='WordPressTest.TenorDataSourceTests' name='testDataSourceReceivesRequestedCount' time='1.042'/>

The previous version of our script were not smart enough to detect that, and simply extracted all testcase nodes that had a failure subnode, and built the annotation with the list of found failures from that. Which is why it erroneously included the flaky failures

How?

To solve this, during the iteration on the testcase[failure] node candidates, we are now finding all the sibling nodes to that testcase which happens to have the same classname and name attributes, i.e. nodes reporting all the assertion failures for the same test. This list will thus include the current candidate being iterated on, but also potentially all other retries of the same test.

Once we got that list of nodes, we check if the last one of those nodes ends up being a failure or success XML node:

If it ended up in success, that means this was a flaky test that ultimately ended up being marked green by Xcode and was considered passing (even if it failed first and was retried)
While if that last instance ended up with a failure, that means the test never succeeded even during potential subsequent retries, so that's a true failure that needs reporting

Testing

WPiOS Demo build with ❌ failed + ⚠️ flaky tests

Check this Buildkite build — which corresponds to the build of the test branch I created in WPiOS specifically to create flaky tests and point to this branch of the plugin
Confirm that the build is red / marked as failed
Confirm that both the error and warning annotations are present under the "All Jobs" tab, and that both look ok
- Note: if Buildkite shows you the "Issues" tab instead of "All Jobs" tab by default, that will only show you the error annotation; be sure to check the "All Jobs" tab to see the warning annotation too.

WPiOS Demo build with ⚠️ only flaky tests

Check this Buildkite build — which disables all failing tests and only keep flaky ones
Confirm that the build is green / passed
Confirm that there is no error annotation but only a warning annotation, and that it looks ok

WPiOS Demo build with ❌ only failed tests

Check this Buildkite build — which disables all flaky tests and only keep failing ones
Confirm that the build is red / marked as failed
Confirm that there is no warning annotation but only an error annotation, and that it looks ok

AliSoftware commented 1 year ago

@mokagio I spent some time (most of my afternoon, tbh 😅 ) testing things around and improved the changes significantly:

I created a test/failure-retry-junit-report branch on the WPiOS repo for testing purposes:
- Added a new "FlakyTestsDemo" Xcode Unit Test target
- Added a test plan configured with 3 retries
- Created unit tests specifically designed to cover the various cases (always fail, flaky but ultimately pass, skipped, …)
- Adjusted the fastlane lanes running Unit Tests to run that test plan instead
- Commented most of the pipeline.yml and make it point to this fix/annotate_test_failures branch of the a8c-ci-toolkit-buildkite-plugin
- That allowed me to get a report.junit XML file that represented a typical report for the various cases you raised, which I then used as a base to test and improve my annotate_test_failures script here.

The improvements I made include:

Created a dedicated TestFailure ruby class to make handling of failure nodes easier to manipulate
Detect exact duplicate failures—differentiating between cases when a test reports multiple (distinct) assertion failures in a single run, vs cases when a test is retried and reports the same assertion failure (and on the same file and line) multiple times (one for each failed retry). And track the count of those instead of adding the duplicate to the array of detected failures
This allows to build the list of distinct assertion failures, and identify the test that each TestFailure is about, allowing us to distinguish the number of distinct failed tests, the number of distinct failed assertions (which can be different if a test had multiple assertion failures), and the number of times each failure was reported (for cases of retries)
I'm now also tracking and reporting flaky tests in a dedicated warning annotation, in addition to only report true failures in the error annotation.

@mokagio Given all the changes I've made after all that intense testing and tweaking, I'm interested in your re-review! (Note: I've updated the testing instructions in the PR description)

mokagio commented 1 year ago

While commenting about reporting the return code from the test task rather than the annotation command, I realized that there is nothing platform-specific that forces us to run the annotation logic in the macOS agent.

We could move it to a cheaper agent with Ruby support, but:

The build time overhead of the annotation is negligible compared to the test run time
Having tests and annotation together is handy if we retry the tests. I'm not sure how it would work if we split them... Does Buildkite handle a retry step re-triggering steps that are downstream from it? 🤔 No sure... and given point 1, don't think it's worth finding out

Automattic / a8c-ci-toolkit-buildkite-plugin