MarathonLabs / marathon

Cross-platform test runner
https://docs.marathonlabs.io
GNU General Public License v2.0
579 stars 121 forks source link

How can I re-run a test until failure? #360

Closed ZakTaccardi closed 3 years ago

ZakTaccardi commented 4 years ago

I would like to execute a test up to X times

Count sharding almost works for this, except if a test passes 7/8 times, the gradle task to execute it actually succeeds. I would like the gradle task to fail immediately once a test fails.

https://malinskiy.github.io/marathon/doc/options.html#count-sharding

Ultimately, I'm trying to create an API like:

./gradlew :lib:marathonDebugAndroidTest -PmarathonClass=MainFragmentTest -PmarathonMethod=test -PmarathonTimesToRetryUntilFailure=30

The idea being I could test MainFragmentTest#test up to 30 times to check for flakiness. Once the failure occurs - display the report.

Malinskiy commented 4 years ago

@ZakTaccardi you can use the strictMode. By default it's false but if set to true it will retry the test until either failure of all the retries are finished. Please see #196 and use search next time :) This is also in the docs here

ZakTaccardi commented 4 years ago

@Malinskiy so this actually doesn't quite work for my use case.

I want to fail the build immediately once a test failure occurs.

shardingStrategy {
  countSharding {
    count = 5
  }
}
strictMode = true

Expected: run 1: PASS run 2: PASS run 3: FAIL build ends immediately with failure. dev can view marathon report which includes:

Actual: run 1: PASS run 2: PASS run 3: FAIL run 4: PASS run 5: PASS builds ends with failure. marathon report:

ZakTaccardi commented 4 years ago

To address this, maybe strictMode should have 3 total variations:

probably could name it better than "run all tests" though. Maybe if it were represented like this:

sealed class StrictMode {
  data class On(failFast: Boolean = false): StrictMode()
  object Off : StrictMode()
}
Malinskiy commented 4 years ago

If the test fails then all of it's retries are not gonna be attempted by default. If I understand your use-case correctly, you want marathon to execute tests A,B,C and if even one of them will fail cancel everything and return.

This will produce skewed statistical probabilities in the metrics (same as in #329): the precision for tests will be different because some of these tests were executed more and some of them - just until the first fail. This will result in higher-than-actual probability of passing. This will hence produce flakier builds because marathon will not be able to correctly predict the number of preventative retries.

This contradicts our vision for performance and stability of test execution.

ZakTaccardi commented 4 years ago

If I understand your use-case correctly, you want marathon to execute tests A,B,C and if even one of them will fail cancel everything and return.

I actually want marathon to execute Test A 5 times, but fail test execution immediately once the first failure occurs.

This contradicts our vision for performance and stability of test execution.

This wouldn't be the type of code you run in CI - this is something you would execute locally. Statistics like a test being flaky 3% of the time isn't relevant for this use case. This is about the ability to investigate and fix flakiness once it has been identified. I find having (and being able to reproduce locally) a marathon report for a test where flakiness has occurred is very useful. This allows me to look at the logs of the failure to troubleshoot the issue.

The below is the custom CLI API I'm trying to build to make troubleshooting a specific flaky test possible.

./gradlew :lib:marathonDebugAndroidTest -PmarathonClass=MainFragmentTest -PmarathonMethod=testA -PmarathonTimesToRetryUntilFailure=30

So MainFragmentTest#testA would be executed up to 30 times, but will stop execution immediately once a failure occurs. So if it succeeds the first 9 times, but fails on the 10th, I want the gradle task to fail and test executions 11-30 would not be run.

Not executing the remaining 20 test runs has two advantages

Malinskiy commented 4 years ago

I believe you're approaching the problem in the wrong way. 3% of test passing probability is a huge difference of flakiness for a test executed on an enterprise scale. We should be talking about 0.1% probability improvement.

Each day your test might be executed 100 times (say 100 commits per day). In case we're talking about a 3% failure rate (let's say you're fully fixing a test to a probability of 1) then each day you have a probability of 0.97^100 ~= 0.05 = 5% of this particular test never failing. This means each day you will observe flakiness problems and it is unacceptable in my opinion. When we're talking about 0.1% precision then the probability of such an event would be 0.904 which is kind of acceptable. Problem is that when we're dealing with 0.1% improvement, in order to measure such an improvement with enough accuracy you would need to execute the test at least 1000 times. I don't think it's acceptable to do this on a local machine: for a test that is 1 minute long your dev loop will be ~16 hours long.

That's why we have the sharding mode. The idea is that you ask marathon to execute the test X number of times (depending on your required precision) and then verify the results on a large device farm and parallelize the execution. Then your dev loop for fixing flakiness can still be sane provided you have enough devices.

ZakTaccardi commented 4 years ago

I believe you're approaching the problem in the wrong way.

I'm simply trying to reproduce the flaky problem locally, and have a marathon report with the failed test information. as soon as possible (if the test is flaky).

I'm not saying your recommendation isn't good too, but my use case isn't trying to fix flaky tests that have a .001% flakiness problem.

I just used my approach to fix a flaky test locally that was failing about 20% of the time. I did the following

./gradlew :lib:marathonDebugAndroidTest -PmarathonClass=MainFragmentTest -PmarathonMethod=testA -PmarathonTimesToRetryUntilFailure=30

The test failed on run 3, but it still executed the remaining 27 times.

I implemented the fix, and ran 30 times again:

> Task :lib:marathonDebugAndroidTest
Allure environment data saved.
Marathon run finished:
Device pool omni:
        1 passed, 0 failed, 0 ignored tests
        Flakiness overhead: 286568ms
        Raw: 30 passed, 0 failed, 0 ignored, 0 incomplete tests
Total time: 0H 6m 3s

It's now safe to say that my fix did work.

Specifically I am asking to provide a way to improve this step:

The test failed on run 3, but it still executed the remaining 27 times.

The extra 27 runs aren't helpful here, and are a waste local development time.

ZakTaccardi commented 4 years ago

ultimately I think your recommendation on how to use count sharding mode is different for what I'm trying to achieve. And I think the use case your referring to is valid, as well as mine.

If countSharding isn't the correct approach to achieve my use case, I would be happy for a separate option. I just want to re-run an identified flaky test up to X times until failure

Malinskiy commented 4 years ago

When you are saying that you ran the test 30 times and you've fixed the problem keep in mind that if you continued to execute it then there is a chance that it would fail the next 70 times. We're talking about the same thing btw

ZakTaccardi commented 4 years ago

I understand that to fix flakiness at a scale of .01%, you need huge test runs on large device farms, so local testing isn't practical here.

But in either scenario, it’s useful to have a marathon report that contains the full logs of the failure. The below scenario demonstrates the problem.

running a test with count sharding of 3

run 1: PASS
run 2: FAIL
run 3: PASS

The failure logs from run 2 are lost, which makes it hard to figure out what went wrong in run 2 when the logs you look at are from run 3 in the marathon report.

The ability to bail out of a test when it fails immediately solves this problem. It also creates a faster feedback loop

ZakTaccardi commented 4 years ago

If my request to terminate the run early to improve the developer feedback loop causes an issue for analytics, then maybe it would make sense to allow the disabling of pushing of analytics data locally?

https://github.com/Malinskiy/marathon/issues/365

ZakTaccardi commented 4 years ago

@Malinskiy would you be willing to accept a PR that introduces this feature? If yes, I would be happy to discuss at API for this that doesn't introduce analytics issues

Malinskiy commented 3 years ago

Yeah, let's talk about the changes needed for this feature. We can enhance the strict mode for this use-case. Feel free to ping me in our Slack

Malinskiy commented 3 years ago

I'm going to close this issue since this is a duplicate of #329. Please continue the discussion there