Closed bkase closed 2 months ago
Would love to have the bar set higher, to a failure rate of no more than 10%. Could this be realistic given the improvements we are making? If not we should aim for this in the future.
How do we know that it's a test failure and not a product issue? First step should be to confirm it's an actual bug.
Suggest that broken tests are removed as a blocking test and moved to a quarantine suite (can we make the quarantine suite run but not block or fail? this would be better than a pass). We can also set a max size for the quarantine suite so if it gets past 3 tests we need to pause and address them.
Think we should set an SLA for fixing both categories of tests with P0 tests being addressed immediately
Also tests that are broken because of code change (meaning the test needs to be updated), also should be addressed immediately, so we don't accrue test debt
Can build kite store and show the information for test history? Can it automatically tag a test as flakey or broken? It should be easy enough to tell if a test is broken or flakey if it’s passed at least 4 of the last 5 runs.
Also don’t think the process needs an RFC but whatever functionality needed from CI to deal with this should be captured in the build kite RFC or separate mini rfc
How do we know that it's a test failure and not a product issue? First step should be to confirm it's an actual bug.
I think this will be caught by blocking pull-request's test runs with an exception being if some commit was force-landed which I've added a note on above.
Also tests that are broken because of code change (meaning the test needs to be updated), also should be addressed immediately, so we don't accrue test debt
Can you clarify what sorts of errors you're referring to here? Aren't all tests broken because of code change?
One thing I'm confused about with a 10% failure rate is: what counts as a completed job? Is it just commits to develop? Does it count jobs run specifically in scope for a PR? If it's just commits to develop, do we have a high enough commit rate to observe the failures reliably enough?
Should we ratchet down the failure rate over time?
One thing I'm confused about with a 10% failure rate is: what counts as a completed job? Is it just commits to develop? Does it count jobs run specifically in scope for a PR? If it's just commits to develop, do we have a high enough commit rate to observe the failures reliably enough?
Hmm this is a good point. I think we must only count develop runs because (after the new CI anyway) jobs will only run if there is a change to some file that affects it. Thus, I don't think we have a high enough commit rate to observe the failures reliably.
One alternative approach is to stage develop failures in a basket of "questionable jobs" and run those nightly 30x to determine if they are flaky or broken. What do people think of this?
Should we ratchet down the failure rate over time?
I think the nature of the project and the sorts of tests we want to run (those with randomness), it will be a hard metric to optimize on.
Hmmm...nice proposal and interesting thoughts all around.
So automated retries, quarantines and alerts for maintenance of potentially flaky tests/jobs sound good and have proven to be solid strategies for identification of "flakes" in my experience though I just started looking into the logging/monitoring infrastructure while ramping up and this discussion on flaky tests got me thinking of what thoughts are around current debugging/troubleshooting tooling (e.g. log & metric visualizers for dev/test environments, IDE debuggers).
Considering the sort of mystical nature of "flakes" and comments found in #4715, I'm curious what thoughts are on how much of the attention should be on diagnosis (..or more like the lack of information/evidence/tooling to properly diagnose the issue) vs. detection. I think I get the point/goal here for the most part though seems like it might be worth going into more detail or at least touching on the troubleshooting process a bit more and ultimately how these "flakes" become "flakes" in this proposal.
I mean, based on the friction point re: "...not having much information as to what a test is doing when the test fails..." + "...not having a good way to surface logs+metrics and collect them + logs are squished in a way that's hard to deal with", I wonder how much the thinking towards flakes would be impacted by a solution to this observability problem.
Better organized logs and more evidence generally lead to quicker test failure analysis and it feels somewhat tough to make a call on what's a good model for "flakey" or actually debug and expect resolution of flakes in a predetermined manner without stable, consistent and hopefully comprehensive monitoring.
Actually, somewhat thrown off by the test vs. job distinction -- feel free to disregard ^^^ atm :grimacing:
I like the idea of repeatedly testing "questionable jobs"!
This was synthesized from a conversation with @Schmavery and @mrmr1993 (though I embellished, so feel free to comment if ya'll disagree)
Problem
There are several jobs that fail intermittently right now and it's hard to tell if they are real failures or flakes. As such, many people have started to ignore some of the job failures merely assuming they are flakes.
Proposal
Two parts:
Definition of Flaky
Flaky is defined as jobs that fail no more than 10% of the time. The expectation should be that a 3x retry effectively guarantees (in the worst case 1 in 1000 chance) of failures slipping through. This will be supported in #4763
Definition for Broken
Jobs are broken if they fail more than 10% of the time. No exceptions.
Process
When there is a job failure on trunk, the first thing is to check whether the last thing that landed was forced without running jobs. If so, try reverting that commit and waiting for jobs to run before proceeding with the below:
Broken jobs
If a job is broken on trunk, it is immediately removed as a blocking test. It is also removed from the suite of CI jobs that run on PRs and added to a special quarantine suite.
The quarantine suite will always pass with a ✅ on the PRs (via forcing an exit code 0 on a job). The quarantine suite will have a max size of 3 jobs. Any more, and we must pause feature development to address the debt.
In addition, broken jobs get triaged as issues and prioritized based on their severity
SLA for fixing:
Flaky jobs
If a job is flaky, it is immediately marked as flaky in our CI system (as shown in #4763).
Flaky tests also get triaged as issues and prioritized based on their severity.
Detecting Flaky jobs
BuildKite has a nice GraphQL API that will make it possible to query for the failure rate of a given job or pipeline. We can create a discord chat bot that on some interval (daily?) checks for the failure rate and alerts us in a discord channel for tests we need to mark as flaky that aren't and jobs that are marked flaky that have passed the (1 in 10) threshold over the last 30 completed runs of the job.
Note the above is not a blocker for taskforce disbandment but is a good infrastructure task for the future.
Epic: #4735
Edit Changelog