Investigate how to identify/show flaky tests inside Avocado

beraldoleal commented 3 years ago

Some CI environments have the ability to identify flaky tests so developers could maybe remove those tests from the suite or make them "safe-to-ignore" (or something similar).

One way to identify this is by analyzing all runs of a specific suite and comparing the results. This could be a feature of avocado-server interface. Another way is to re-run failed tests.

This is an issue is to brainstorm and evaluate if we would like to deliver this as a feature.

beraldoleal commented 3 years ago

@wainersm thanks for the suggestion. Please feel free to add your comments here.

beraldoleal commented 3 years ago

One possible idea: have a --detect-flaky-tests, and a --move-flaky-to-quarantine (names are just examples, we could improve this latter):

Run a suite;
Detect failed tests;
Re-run those, and if passed, this is a strong candidate to be a flaky tests;
Tag those tests as flaky; (Quarantine is just the pool of tests with the 'flaky' tag);

This will let developers know all the flaky tests and let the Job passing (not affecting healthy test trust). The idea is not to postpone this investigation, someone needs to investigate this soon. In order to force this, we could set limits of "how many" flaky tags we could have in a project. Not sure, just ideas.

wainersm commented 3 years ago

Let me begin by explaining how flaky tests are treated by OpenShift CI as it seems to me a simple and good enough example for most cases. Keep in mind that I am not expert on OpenShift tests/CI but recently I have being involved with adding Kata Containers in that CI, therefore some information here might not be accurate.

The OpenShift provides a suite with thousands of end-to-end (e2e) tests, mostly tests inherited from Kubernetes project. Those implements operations which an admin and/or developer would carry out on the platform, for instance, create a pod with two or mode containers sharing a volume.

e2e tests can be fragile because of many factors, for example, external entities (network, registry, leader election,..etc) can interfere on the result of a given operation, thus tests may misbehave once in a while. That kind of test is known as "flaky". Their CI employ a very simple heuristic to catch and mark flaky tests:

If a given test "failed", run it again
If it pass on the 2nd execution, mark it "flaky" instead of "failed". The CI job won't be marked "failed" either.
In case it failed again then definitively mark it "failed", consequently the job becomes "failed" too

beraldoleal commented 3 years ago

@wainersm this is much closer to what I imagine as a possible implementation and what I described. I would just add the ability to control how many flaky tests could be on the queue, or for how long. But that is another piece of the puzzle.

wainersm commented 3 years ago

The QEMU project provides some Avocado-based tests (called "acceptance tests). Very often it needs to deal with flaky tests, but usually two actions are taken:

1) Disable the test 2) Or increase the test timeout because usually the flaky test will hit the time limit for its execution.

Detecting a test is flaky is currently manual. Someone on the community will notice the test fails from time to time then they decide how to deal with it. For example, in https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg07334.html a flaky test is enabled again but its scope was reduced on the tentative to avoid intermittent failures. Another example is in https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg06852.html where the test timeout was increased.

Such as "auto-detect flaky" tests feature on avocado could help them to spot the problematic cases.

beraldoleal commented 3 years ago

@wainersm, I'm curious about item 2.

Is this test doing some "time limit execution" part of the preparation (for instance downloading an image) or it is the real test?

wainersm commented 3 years ago

@wainersm, I'm curious about item 2.

Is this test doing some "time limit execution" part of the preparation (for instance downloading an image) or it is the real test?

Hi @beraldoleal , I didn't have a look at the logs but I know that the test boot a kernel several times. Maybe it is getting longer when the running machine is overloaded (or some other stuff gets the process slow) as a result it hits the time limit.

avocado-framework / avocado

Investigate how to identify/show flaky tests inside Avocado #4396