getsentry / sentry-cocoa

The official Sentry SDK for iOS, tvOS, macOS, watchOS.
https://sentry.io/for/cocoa/
MIT License
798 stars 315 forks source link

Intelligent flakiness mitigation #3897

Open armcknight opened 5 months ago

armcknight commented 5 months ago

Description

We've had a persistent problem with flaky tests for a long time.

One process we've tried using is to disable a flaky test with a PR, and open an issue to fix it. This has not eradicated the issues, and doesn't scale that well.

One automation we've attempted in the past is to allow up to a certain number of retries of any test invocation that is known to contain flakes. It was recently removed for our saucelabs runs in https://github.com/getsentry/sentry-cocoa/pull/3660 but still is in use for our UI test ASAN runs:https://github.com/getsentry/sentry-cocoa/blob/e0904ef6b8b510e9739820aa29379b074b26e237/.github/workflows/ui-tests.yml#L96

However, we have unit and UI tests that basically always contain at least one flake. This requires each PR author to check the test run, read what failed, decide if it's a real failure related to their changes, and use the web UI to initiate a retry of the failed tests, and then wait for that round to complete again. This process adds hours to each PR.

I propose automating that manual process the same way we've done it elsewhere, by using the retry strategy with https://github.com/getsentry/sentry-cocoa/pull/3881 and https://github.com/getsentry/sentry-cocoa/pull/3883.

Then, we should write some automation that will extract the name of any test that did not fail in every retry, where more than one test run was actually required, and report that in a PR comment so we can more easily perform our current flaky test process of disabling it and filing an issue to fix it, vs having to dive into the GHA logs for each one.

Even that could potentially be automated, since we can open github issues with their API, and skipping a test is just inserting a line of text into the .xcscheme file. We could also pipe the event to slack so we are aware of it happening there, too.

kahest commented 4 months ago

Thanks @armcknight, I agree with a lot here. A few thoughts/opinions:

armcknight commented 3 months ago

I like the idea to send these to sentry.sentry.io so we can get slack alerts, that would be nicer than rebuilding a slack alerter from scratch in the test scripts.

Let me know how I can help if you have more plans for CI/etc this quarter!