Replay failed tests - Githubissues

theredfish commented 2 years ago

I was wondering if it's something that we can add to the framework. We already have a function to repeat failed tests output but we don't have a function to run failed tests again. It's common to have flaky tests that sometimes can be solved by running them again.

For example :

#[tokio::main]
async fn main() {
    AnimalWorld::cucumber()
        .replay_failed(2)
        .run_and_exit("tests/features/book/output/terminal_repeat_failed.feature")
        .await;
}

Where 2 is the maximum number of times the tests should be run again in case of test failures. Let's say we have the tests A, B, C and D :

During the first run only A passes
Then B, C and D are run again (1 replay left if we have new test failures)
Only B passes, we run again C and D (0 replay left)
D fails again, this one is certainly more than just unstable

Regarding the output we have two choices :

Print all test executions : it's transparent but can be repetitive when the tests fail multiple times (like D in this example)
Or just print the result once the last replay is done (which can be the maximum, here 2, or a previous run if all tests are passing)

tyranron commented 2 years ago

@theredfish I'm not sure whether it has sense. Flaky tests should not be a common thing. In our codebases we actively reworking them to remove any flakiness, because it's very hard to reason about tests when you have the one: either a test fails due to flakinees, or the test may pass due to flakiness too. So I'm somewhat against about making life easier with flaky tests instead of giving people more reasons to remove it.

Let's hear what other users think about it.

ilslv commented 2 years ago

@theredfish I do think, that adding retries for flaky tests is a great feature to have, but I have couple concerns about proposed implementation.

.replay_failed(2)

In addition to specifying number of times test should be retried, I think that we should retry only tests tagged as @flaky or something like that, as being explicit is better that implicit here. Maybe even allow to override this value with something like @flaky(retries = 3). I want this library to be a tool, that is hard to misuse and with defaults that follow best practices. So adding @flaky should be good point of friction for user to think twice, why this Scenario is flaky.

During the first run only A passes

Then B, C and D are run again (1 replay left if we have new test failures)

Only B passes, we run again C and D (0 replay left)

D fails again, this one is certainly more than just unstable

I think, that this may lead to unexpected problems: on panic changes from step B may be partially applied and fully retiring it may cause some unexpected changes in World state. Consider the following Step implementation:

#[given(...)]
fn apply_then_assert(w: &mut World) {
    w.counter += 1;
    assert_eq!(w.counter, 3);
}

Retrying this Step after an assert_eq! will always lead to incrementation of the w.counter. So, as we don't impose Clone bound on our World (otherwise we would be able to .clone() the World before every Step and rollback to it, if needed), the only option left is to retry entire Scenario.

Or just print the result once the last replay is done (which can be the maximum, here 2, or a previous run if all tests are passing)

I'm not sure this can be achieved with streaming output like ours. And even if it would, I think, that we should be transparent about failed flaky tests and also include stats about them in Summarized Writer.

@tyranron

Flaky tests should not be a common thing.

I saw couple of conference talks, where people from huge FAANG-like companies argued that at this scale flaky tests are inevitable. I'm not sure I agree with them, but there is at least opinions floating around. Also other test runners provide this feature out of the box.

tyranron commented 2 years ago

@ilslv okay, I see that there are enough reasons for having this feature.

Speaking of the feature design...

In addition to specifying number of times test should be retried, I think that we should retry only tests tagged as @flaky or something like that, as being explicit is better that implicit here.

I think, both variants should be provided, so we have the natural tags inheritance:

tag on a scenario enables retrying for that scenario and overwrites any other settings;
tag on a feature enables retrying for all the scenarios of that feature;
specifying retries on the whole runner or via CLI enables retrying for the whole set of scenarios.

I propose @retry(n) tag, --retry=n CLI option and .retry(n) runner method would be quite a consistent and ergonomic naming here.

And even if it would, I think, that we should be transparent about failed flaky tests and also include stats about them in Summarized Writer.

The detailed summarization is "must have" 👍

Questions:

Should we retry the scenario right after its failure, or wait for the whole set to finish and only then retry failures? Should this be controlled by option (like @retry(3).postponed/@retry(3).immediately)?

ilslv commented 2 years ago

@tyranron I think that reasonable default should be rerunning Scenarios immediately. But it would be great to support postponed reruns or even combination of postponed and serial for really flaky ones.

tyranron commented 2 years ago

@ilslv yup.

It also worths to mark the retried scenarios explicitly in the output, like the following:

Feature: Animal feature
  Scenario: If we feed a hungry cat it will no longer be hungry
    ✔  Given a hungry cat
    ✔  When I feed the cat
    ✔  Then the cat is not hungry
Retry #1 | Feature: Animal feature
  Retry #1 | Scenario: If we feed a hungry cat it will no longer be hungry
    ✔  Given a hungry cat
    ✔  When I feed the cat
    ✔  Then the cat is not hungry

theredfish commented 2 years ago

Thank you for the feedback! Indeed the idea isn't to encourage to ignore flaky tests but have a way to handle them waiting for a fix.

The tag is a good idea so we offer a different granularity and an explicit way.

I think, that this may lead to unexpected problems: on panic changes from step B may be partially applied

My bad it wasn't clear enough but my example was about scenarios not steps.

I can try to implement this feature if you want ? If you're available to guide me during the development ?

ilslv commented 2 years ago

@theredfish

My bad it wasn't clear enough but my example was about scenarios not steps.

Actually Scenarios are generally run in parallel, so there is no need for additional complexities you've described. We can just rerun failed Scenarios on their own.

I can try to implement this feature if you want ? If you're available to guide me during the development ?

I'll be happy to help you with development of this feature!

theredfish commented 2 years ago

I was a bit busy with other projects, I'll start to work on it :)

ilslv commented 2 years ago

While implementing retries for failed Scenarios I've stumbled upon an inconvenience in the discussed design. In case Writer is wrapped into a Normalize I expect it to wait on current Feature/Rule until all retried Scenarios are passed or drain out their retry attempts. While In case retry.immediately this works fine, for retry.postponed this will wait until all other Scenarios start their execution and only then retry it's attempt as it places retried Scenarios at the end of the queue. This will obviously hang up output for a while.

We can use retry.postponed(3s) to specify the delay for a retry. This can be almost achieved while still being runtime-agnostic with the following algorithm: use now() + delay as a deadline and place Scenario in front of the queue. On selection we will drain out only Scenarios which deadline isn't exceeded yet. The only problem with this solution is that we can end up only with Scenarios with active deadlines. Here we can be stuck inside a busy-loop or spawn a thread, that will sleep and notify us, when shortest deadline exceeds.

Another way to deal with this problem is to output retried Scenarios as a separate brach, re-outputting Feature and Rule. Here we need to understand what will actually be outputted: only retry itself or with additional info about previous/all failures.

ack @tyranron @theredfish

theredfish commented 2 years ago

That's great if you can do some progress on this feature, I haven't been able to work on it since my last commit. But I can continue later if you want. Just switching between priorities for the moment.

With your current example how do you specify the number of attempts for a postponed retry? Do we remove the possibility to specify de number of attempts per retry tag?

Since postponed is a strategy selected by users, they should expect more delays, is it really an issue to have to wait more? It might even be intended to decouple the execution from the rest. Flaky tests can be due to side effects created by other tests for example.

Regarding the delay if I understand it will block the execution until the delay is reached, but will execute the test asynchronously from the rest.

Another way to deal with this problem is to output retried Scenarios as a separate brach

What do you mean by separate branch? From the execution flow? Then it will be a bit confusing with the existing feature replay_failed_tests no? What if we have both?

Maybe we can keep a "naive" implementation first and then iterate? We could provide the immediate retry as a first solution, and later work around a postponed version where we can discuss more in depth of the different solutions. It will avoid to worry about the output and existing features. A retry with a delay is still a good idea I think, most of the time we need to wait a bit before retrying. We could then have only one default mode, if there is no delay we retry without one, if there is a delay we just wait during this time. As a first implementation it might be enough.

Or we can reduce this first feature to only an immediate retry. Do we keep the number of attempts config?

tyranron commented 2 years ago

@theredfish

What do you mean by separate branch? From the execution flow?

It means this.

Then it will be a bit confusing with the existing feature replay_failed_tests no? What if we have both?

No, the replay_failed_tests just re-prints the failed scenarios at the end. So, it won't print successfully retried scenarios at all. Just those, which failed even after all the retry attempts.

@ilslv

We can use retry.postponed(3s) to specify the delay for a retry.

Well, the delay for retry is a good feature to have. And having so much options to configure retrying, I think we need to revisit the @retry tag syntax: it may be immediate, postponed with delay, postponed to the end, at the same time we do want to specify the attempts number.

Maybe, something like this?

@retry(3, after = 3s): retries 3 times with 3 seconds delay each time.
@retry(2): retries immediately 2 times.
@retry: retries immediately one time by default (effectively the same as @retry(1)).
@retry(3, after = all): retries 3 times, each time after all tests have been run.

So, the "postponed" case is handled by the optional after = parameter.

Regarding the CLI option, we may:

either allow this syntax in --retry (like `--retry='2, after = 3s');
or keep them as separate CLI options (like --retry=2 --retry-after=3s).

The same goes for methods too.

I'd like keeping them separate more.

Another way to deal with this problem is to output retried Scenarios as a separate brach, re-outputting Feature and Rule.

After giving it quite a though, I see this as the only reasonable way. Retrying without giving explicit feedback about it is quite a subtle thing. We better notify explicitly about what's going on. Also, mentioning retries in the overall execution stats would be a very pleasant feature too. So I vote for re-outputting only.

ilslv commented 2 years ago

@tyranron

Retrying without giving explicit feedback about it is quite a subtle thing. We better notify explicitly about what's going on.

Clear understanding of what's going on is a very important thing for me too, but I think the clearest way of notifying about retries is doing it the closest to the error itself. I image it something like that:

Maybe, something like this?

Actually @retry(3, after = 3s) isn't possible, as tags don't allow spaces, so maybe we return to @retry(3).after(3s)

tyranron commented 2 years ago

@ilslv

Actually @retry(3, after = 3s) isn't possible, as tags don't allow spaces, so maybe we return to @retry(3).after(3s)

I don't mind.

but I think the clearest way of notifying about retries is doing it the closest to the error itself.

But in case of .after(all) this will obviously stall the output. Hmmm... what if we output them right next to the failed step if there is no .after specified (immediate retries), otherwise we output them separately right when they're really executing?

ilslv commented 2 years ago

@tyranron I think, that having 2 different behaviours for immediate and postponed retries may be confusing for the users, mostly because reasons behind it aren't obvious. Even we didn't see them through until the implementation came.

I still think, that same Feature/Rule branch is a way to go because:

It's easier to interpret the output.
Retry of the whole Scenario should be used as a last resort. I think that we should push our users to implementing in-Step retry logic first. For example in case of a web-app retrying requests with error response n times before failing the whole Step, because thats how real frontends are interacting with the backends.

I expressed this concern before and I still think it's important, that this library should push users into using the best practices. So having a little bit of friction in a place, where user's tests are flaky and should be retried after a certain timeout is actually a good thing.

Also I still have some questions about how separate Feature/Rule branch should be outputted. Mainly how do we output errors of previous retries?

tyranron commented 2 years ago

@ilslv

Mainly how do we output errors of previous retries?

Why not just output them as a regular error?

I expressed this concern before and I still think it's important, that this library should push users into using the best practices. So having a little bit of friction in a place, where user's tests are flaky and should be retried after a certain timeout is actually a good thing.

Agree.

I still think, that same Feature/Rule branch is a way to go

So, could you sum up the whole feature design then? With the answers addressing the questions raised above in previous comments.

tyranron commented 2 years ago

@ilslv also, it seems that we've missed how upstream implementation handles this. And it definitely should, according to quick googling. Worth investigation, because complying with upstream is something that our users often want and we've hit it multiple times already.

ilslv commented 2 years ago

@tyranron out of all upstream implementations only ruby and js have support for retrying scenarios:

And even those who implement are using only CLI flags (--retry)/env vars to specify retry value for all scenarios. Moreover I couldn't find any implementation with streaming live output like ours. Other implementations just provide a bunch of final report formats, so you have to wait until the end or use --fail-fast CLI option. And none of them allow specifying the retry strategy.

Why not just output them as a regular error?

Should we output only previous retry error or all of them? And I'm not sure how to communicate clearly, that it isn't a new error, just a re-outputted one.

So, could you sum up the whole feature design then? With the answers addressing the questions raised above in previous comments.

I think we have a consensus on tags looking like @retry(3).after(3s) and supporting:

Number of retries (1 if not specified)
Retrying immediately (default), after timeout or at the end (.after(all))
Propagating from them Feature/Rule to all underlying scenarios

Alternatively you can overwrite number of retries for all scenarios via --retry CLI option and same with strategy and --retry-after.

Regarding the output I think that clearest way to understand what's going on with the retries is shown here. This means not moving on to the next Scenario until current one succeeds or drains out all of its retry attempts. This poses a problem with hanging output in case .after(...) is specified. We have to wait for exact time provided in the tag or until all other tests finish.

But I think that current implementation in #223 strikes a good balance between hanging output and actually seeing progress. This implementation doesn't move onto a next Feature (instead of Scenario), until all Scenarios inside pass or drain out their retry attempts. In practice that means you can see another Scenarios interleaved with retried ones.

tyranron commented 2 years ago

@ilslv

Should we output only previous retry error or all of them?

I think we should clearly output all the retry attempts and their results.

And I'm not sure how to communicate clearly, that it isn't a new error, just a re-outputted one.

Haven't got it. What do you mean by "re-outputted"?

I think we have a consensus on tags looking like @retry(3).after(3s) and supporting:
* Number of retries (1 if not specified)

* Retrying immediately (default), after timeout or at the end (`.after(all)`)

* Propagating from them `Feature`/`Rule` to all underlying scenarios
Alternatively you can overwrite number of retries for all scenarios via --retry CLI option and same with strategy and --retry-after.

Yes, that's what I've thought of. But how do we resolve the issue with the stuck output in .after(all) situation? Ignore it? Re-bikeshed the output? Disallow .after(all) at all?

This implementation doesn't move onto a next Feature (instead of Scenario), until all Scenarios inside pass or drain out their retry attempts. In practice that means you can see another Scenarios interleaved with retried ones.

This implies that .after(all) means "after all scenarios in the feature" rather than "after all test suite", right?

And even those who implement are using only CLI flags (--retry)/env vars to specify retry value for all scenarios. Moreover I couldn't find any implementation with streaming live output like ours. Other implementations just provide a bunch of final report formats, so you have to wait until the end or use --fail-fast CLI option. And none of them allow specifying the retry strategy.

Actually, I liked the --retry-tag-filter feature of the JS implementation. May be worth supporting in addition to what've discussed.

ilslv commented 2 years ago

@tyranron

Haven't got it. What do you mean by "re-outputted"?

Whole discussion about re-outputting errors is linked to output retries as a separate branch. By that I mean, that if we choose to go this route, retries can be outputted pretty far from each other, so maybe we should give users a clear idea, what went wrong on a previous run(s?).

But how do we resolve the issue with the stuck output in .after(all) situation? Ignore it? Re-bikeshed the output? Disallow .after(all) at all?

Thinking more and more about this, I think that we can ditch .after(all) at all. At least in a sense "run after all other scenarios". Users have no idea, when this will actually happen, I can't imagine a use-case, when this will be actually useful. Especially when we have @serial, which can also be combined with @retry. I think it originated just from nextest planning to have a similar functionality, not based on an actual need.

This implies that .after(all) means "after all scenarios in the feature" rather than "after all test suite", right?

Hm, I haven't thought about it, but this seems to be more useful, then previous meaning of .after(all). Especially when you consider, that Features usually combines Scenarios linked to the same functionality, that can be a main cause of flakiness. I need to see though, how hard it would be to implement this and how this would interact with retries after timeouts.

Actually, I liked the --retry-tag-filter feature of the JS implementation. May be worth supporting in addition to what've discussed.

Actually I've already implemented more powerful feature, basically like Cucumber::which_scenario(): closure decides how many and when retries should happen.

tyranron commented 2 years ago

@ilslv

By that I mean, that if we choose to go this route, retries can be outputted pretty far from each other, so maybe we should give users a clear idea, what went wrong on a previous run(s?).

I think that just print the error "as is" and later having the | Retry #<num> label is more than enough. Like here, but with an error.

I think that we can ditch .after(all) at all.

I'm not against this. I've thought about it too.

Actually I've already implemented more powerful feature, basically like Cucumber::which_scenario(): closure decides how many and when retries should happen.

That's OK, but the concern I've raised is not about power, but rather ergonomic and CLI. I could easily imagine the situation when someone wants to retry the test suite without populating @retry tags here and there. Like --retry=3 --retry-tag-filter='@webrtc or @http' and then --retry=2 --retry-tag-filter='@webrtc or @http or @animal. It's OK if it will be built on top of Cucumber::which_scenario, but I'd vote to have this in CLI as the use cases and ergonomics benefits are quite clear.

ilslv commented 2 years ago

Discussed with @tyranron:

Remove .after(all) entirely, only .after(3s) is allowed.
Output reties inside single Feature/Rule branch.
Add --retry, --retry-after and --retry-tag-filter CLI options

Detailed explanation of interactions between CLI options and tags

Let's explore how different sets of CLI options would interact with the following Feature:

Feature: Retries
    @retry
    Scenario: First
      Given a moral dilemma

    @flacky @retry(3)
    Scenario: Second
      Given a moral dilemma

    @retry(4).after(5s)
    Scenario: Third
      Given a moral dilemma

  Scenario: Fourth
    Given a moral dilemma

No CLI options at all

First Scenario retried once without a delay
Second Scenario retried 3 times without a delay
Third Scenario retried 4 times with 5 seconds delay in between
Fourth Scenario isn't retried

`--retry=5`

First Scenario retried 5 times without a delay
Second Scenario retried 3 times without a delay
Third Scenario retried 4 times with 5 second delay in between
Fourth Scenario retried 5 times

`--retry-tag-filter='@flacky'`

First Scenario retried once without a delay
Second Scenario retried 3 times without a delay
Third Scenario retried 4 times with 5 second delay in between
Fourth Scenario isn't retried

`--retry-after=10s`

First Scenario retried once with 10 seconds delay in between
Second Scenario retried 3 times with 10 seconds delay in between
Third Scenario retried 4 times with 5 seconds delay in between
Fourth Scenario isn't retried

`--retry=5 --retry-after=10s`

First Scenario retried 5 times with 10 seconds delay in between
Second Scenario retried 3 times with 10 seconds delay in between
Third Scenario retried 4 times with 5 seconds delay in between
Fourth Scenario retried 5 times with 10 seconds delay in between

`--retry=5 --retry-tag-filter='@flacky'`

First Scenario retried once without a delay
Second Scenario retried 3 times without a delay
Third Scenario retried 4 times with 5 second delay in between
Fourth Scenario isn't retried

`--retry=5 --retry-after=10s --retry-tag-filter='@flacky'`

First Scenario retried 5 times with 10 seconds delay in between
Second Scenario retried 3 times with 10 seconds delay in between
Third Scenario retried 4 times with 5 second delay in between
Fourth Scenario isn't retried

tyranron commented 2 years ago

@theredfish released in 0.14.0.

theredfish commented 2 years ago

Thank you very much for the hard work @ilslv @tyranron . That's a fantastic implementation!

cucumber-rs / cucumber

Replay failed tests #212

Detailed explanation of interactions between CLI options and tags

No CLI options at all

`--retry=5`

`--retry-tag-filter='@flacky'`

`--retry-after=10s`

`--retry=5 --retry-after=10s`

`--retry=5 --retry-tag-filter='@flacky'`

`--retry=5 --retry-after=10s --retry-tag-filter='@flacky'`