Closed theredfish closed 2 years ago
@theredfish I'm not sure whether it has sense. Flaky tests should not be a common thing. In our codebases we actively reworking them to remove any flakiness, because it's very hard to reason about tests when you have the one: either a test fails due to flakinees, or the test may pass due to flakiness too. So I'm somewhat against about making life easier with flaky tests instead of giving people more reasons to remove it.
Let's hear what other users think about it.
@theredfish I do think, that adding retries for flaky tests is a great feature to have, but I have couple concerns about proposed implementation.
.replay_failed(2)
In addition to specifying number of times test should be retried, I think that we should retry only tests tagged as @flaky
or something like that, as being explicit is better that implicit here. Maybe even allow to override this value with something like @flaky(retries = 3)
.
I want this library to be a tool, that is hard to misuse and with defaults that follow best practices. So adding @flaky
should be good point of friction for user to think twice, why this Scenario
is flaky.
- During the first run only A passes
- Then B, C and D are run again (1 replay left if we have new test failures)
- Only B passes, we run again C and D (0 replay left)
- D fails again, this one is certainly more than just unstable
I think, that this may lead to unexpected problems: on panic
changes from step B
may be partially applied and fully retiring it may cause some unexpected changes in World
state. Consider the following Step
implementation:
#[given(...)]
fn apply_then_assert(w: &mut World) {
w.counter += 1;
assert_eq!(w.counter, 3);
}
Retrying this Step
after an assert_eq!
will always lead to incrementation of the w.counter
. So, as we don't impose Clone
bound on our World
(otherwise we would be able to .clone()
the World
before every Step
and rollback to it, if needed), the only option left is to retry entire Scenario
.
Or just print the result once the last replay is done (which can be the maximum, here 2, or a previous run if all tests are passing)
I'm not sure this can be achieved with streaming output like ours. And even if it would, I think, that we should be transparent about failed flaky tests and also include stats about them in Summarized
Writer
.
@tyranron
Flaky tests should not be a common thing.
I saw couple of conference talks, where people from huge FAANG
-like companies argued that at this scale flaky tests are inevitable. I'm not sure I agree with them, but there is at least opinions floating around. Also other test runners provide this feature out of the box.
@ilslv okay, I see that there are enough reasons for having this feature.
Speaking of the feature design...
In addition to specifying number of times test should be retried, I think that we should retry only tests tagged as @flaky or something like that, as being explicit is better that implicit here.
I think, both variants should be provided, so we have the natural tags inheritance:
I propose @retry(n)
tag, --retry=n
CLI option and .retry(n)
runner method would be quite a consistent and ergonomic naming here.
And even if it would, I think, that we should be transparent about failed flaky tests and also include stats about them in
Summarized
Writer
.
The detailed summarization is "must have" 👍
Questions:
@retry(3).postponed
/@retry(3).immediately
)?@tyranron I think that reasonable default should be rerunning Scenario
s immediately. But it would be great to support postponed reruns or even combination of postponed and serial for really flaky ones.
@ilslv yup.
It also worths to mark the retried scenarios explicitly in the output, like the following:
Feature: Animal feature
Scenario: If we feed a hungry cat it will no longer be hungry
✔ Given a hungry cat
✔ When I feed the cat
✔ Then the cat is not hungry
Retry #1 | Feature: Animal feature
Retry #1 | Scenario: If we feed a hungry cat it will no longer be hungry
✔ Given a hungry cat
✔ When I feed the cat
✔ Then the cat is not hungry
Thank you for the feedback! Indeed the idea isn't to encourage to ignore flaky tests but have a way to handle them waiting for a fix.
The tag is a good idea so we offer a different granularity and an explicit way.
I think, that this may lead to unexpected problems: on panic changes from step B may be partially applied
My bad it wasn't clear enough but my example was about scenarios not steps.
I can try to implement this feature if you want ? If you're available to guide me during the development ?
@theredfish
My bad it wasn't clear enough but my example was about scenarios not steps.
Actually Scenario
s are generally run in parallel, so there is no need for additional complexities you've described. We can just rerun failed Scenario
s on their own.
I can try to implement this feature if you want ? If you're available to guide me during the development ?
I'll be happy to help you with development of this feature!
I was a bit busy with other projects, I'll start to work on it :)
While implementing retries for failed Scenario
s I've stumbled upon an inconvenience in the discussed design. In case Writer
is wrapped into a Normalize
I expect it to wait on current Feature
/Rule
until all retried Scenario
s are passed or drain out their retry attempts. While In case retry.immediately
this works fine, for retry.postponed
this will wait until all other Scenario
s start their execution and only then retry it's attempt as it places retried Scenario
s at the end of the queue. This will obviously hang up output for a while.
We can use retry.postponed(3s)
to specify the delay for a retry. This can be almost achieved while still being runtime-agnostic with the following algorithm: use now() + delay
as a deadline and place Scenario
in front of the queue. On selection we will drain out only Scenario
s which deadline isn't exceeded yet. The only problem with this solution is that we can end up only with Scenario
s with active deadlines. Here we can be stuck inside a busy-loop or spawn a thread, that will sleep and notify us, when shortest deadline exceeds.
Another way to deal with this problem is to output retried Scenario
s as a separate brach, re-outputting Feature
and Rule
. Here we need to understand what will actually be outputted: only retry itself or with additional info about previous/all failures.
ack @tyranron @theredfish
That's great if you can do some progress on this feature, I haven't been able to work on it since my last commit. But I can continue later if you want. Just switching between priorities for the moment.
With your current example how do you specify the number of attempts for a postponed retry? Do we remove the possibility to specify de number of attempts per retry
tag?
Since postponed is a strategy selected by users, they should expect more delays, is it really an issue to have to wait more? It might even be intended to decouple the execution from the rest. Flaky tests can be due to side effects created by other tests for example.
Regarding the delay if I understand it will block the execution until the delay is reached, but will execute the test asynchronously from the rest.
Another way to deal with this problem is to output retried Scenarios as a separate brach
What do you mean by separate branch? From the execution flow? Then it will be a bit confusing with the existing feature replay_failed_tests no? What if we have both?
Maybe we can keep a "naive" implementation first and then iterate? We could provide the immediate retry as a first solution, and later work around a postponed version where we can discuss more in depth of the different solutions. It will avoid to worry about the output and existing features. A retry with a delay is still a good idea I think, most of the time we need to wait a bit before retrying. We could then have only one default mode, if there is no delay we retry without one, if there is a delay we just wait during this time. As a first implementation it might be enough.
Or we can reduce this first feature to only an immediate retry. Do we keep the number of attempts config?
@theredfish
What do you mean by separate branch? From the execution flow?
It means this.
Then it will be a bit confusing with the existing feature replay_failed_tests no? What if we have both?
No, the replay_failed_tests
just re-prints the failed scenarios at the end. So, it won't print successfully retried scenarios at all. Just those, which failed even after all the retry attempts.
@ilslv
We can use retry.postponed(3s) to specify the delay for a retry.
Well, the delay for retry is a good feature to have. And having so much options to configure retrying, I think we need to revisit the @retry
tag syntax: it may be immediate, postponed with delay, postponed to the end, at the same time we do want to specify the attempts number.
Maybe, something like this?
@retry(3, after = 3s)
: retries 3 times with 3 seconds delay each time.@retry(2)
: retries immediately 2 times.@retry
: retries immediately one time by default (effectively the same as @retry(1)
). @retry(3, after = all)
: retries 3 times, each time after all tests have been run.So, the "postponed" case is handled by the optional after =
parameter.
Regarding the CLI option, we may:
--retry
(like `--retry='2, after = 3s');--retry=2 --retry-after=3s
).The same goes for methods too.
I'd like keeping them separate more.
Another way to deal with this problem is to output retried
Scenario
s as a separate brach, re-outputtingFeature
andRule
.
After giving it quite a though, I see this as the only reasonable way. Retrying without giving explicit feedback about it is quite a subtle thing. We better notify explicitly about what's going on. Also, mentioning retries in the overall execution stats would be a very pleasant feature too. So I vote for re-outputting only.
@tyranron
Retrying without giving explicit feedback about it is quite a subtle thing. We better notify explicitly about what's going on.
Clear understanding of what's going on is a very important thing for me too, but I think the clearest way of notifying about retries is doing it the closest to the error itself. I image it something like that:
Maybe, something like this?
Actually @retry(3, after = 3s)
isn't possible, as tags don't allow spaces, so maybe we return to @retry(3).after(3s)
@ilslv
Actually
@retry(3, after = 3s)
isn't possible, as tags don't allow spaces, so maybe we return to@retry(3).after(3s)
I don't mind.
but I think the clearest way of notifying about retries is doing it the closest to the error itself.
But in case of .after(all)
this will obviously stall the output. Hmmm... what if we output them right next to the failed step if there is no .after
specified (immediate retries), otherwise we output them separately right when they're really executing?
@tyranron I think, that having 2 different behaviours for immediate and postponed retries may be confusing for the users, mostly because reasons behind it aren't obvious. Even we didn't see them through until the implementation came.
I still think, that same Feature
/Rule
branch is a way to go because:
Scenario
should be used as a last resort. I think that we should push our users to implementing in-Step
retry logic first. For example in case of a web-app retrying requests with error response n
times before failing the whole Step
, because thats how real frontends are interacting with the backends. I expressed this concern before and I still think it's important, that this library should push users into using the best practices. So having a little bit of friction in a place, where user's tests are flaky and should be retried after a certain timeout is actually a good thing.
Also I still have some questions about how separate Feature
/Rule
branch should be outputted. Mainly how do we output errors of previous retries?
@ilslv
Mainly how do we output errors of previous retries?
Why not just output them as a regular error?
I expressed this concern before and I still think it's important, that this library should push users into using the best practices. So having a little bit of friction in a place, where user's tests are flaky and should be retried after a certain timeout is actually a good thing.
Agree.
I still think, that same
Feature
/Rule
branch is a way to go
So, could you sum up the whole feature design then? With the answers addressing the questions raised above in previous comments.
@tyranron out of all upstream implementations only ruby and js have support for retrying scenarios:
And even those who implement are using only CLI flags (--retry
)/env vars to specify retry value for all scenarios. Moreover I couldn't find any implementation with streaming live output like ours. Other implementations just provide a bunch of final report formats, so you have to wait until the end or use --fail-fast
CLI option. And none of them allow specifying the retry strategy.
Why not just output them as a regular error?
Should we output only previous retry error or all of them? And I'm not sure how to communicate clearly, that it isn't a new error, just a re-outputted one.
So, could you sum up the whole feature design then? With the answers addressing the questions raised above in previous comments.
I think we have a consensus on tags looking like @retry(3).after(3s)
and supporting:
.after(all)
)Feature
/Rule
to all underlying scenariosAlternatively you can overwrite number of retries for all scenarios via --retry
CLI option and same with strategy and --retry-after
.
Regarding the output I think that clearest way to understand what's going on with the retries is shown here. This means not moving on to the next Scenario
until current one succeeds or drains out all of its retry attempts. This poses a problem with hanging output in case .after(...)
is specified. We have to wait for exact time provided in the tag or until all other tests finish.
But I think that current implementation in #223 strikes a good balance between hanging output and actually seeing progress. This implementation doesn't move onto a next Feature
(instead of Scenario
), until all Scenario
s inside pass or drain out their retry attempts. In practice that means you can see another Scenario
s interleaved with retried ones.
@ilslv
Should we output only previous retry error or all of them?
I think we should clearly output all the retry attempts and their results.
And I'm not sure how to communicate clearly, that it isn't a new error, just a re-outputted one.
Haven't got it. What do you mean by "re-outputted"?
I think we have a consensus on tags looking like
@retry(3).after(3s)
and supporting:* Number of retries (1 if not specified) * Retrying immediately (default), after timeout or at the end (`.after(all)`) * Propagating from them `Feature`/`Rule` to all underlying scenarios
Alternatively you can overwrite number of retries for all scenarios via
--retry
CLI option and same with strategy and--retry-after
.
Yes, that's what I've thought of. But how do we resolve the issue with the stuck output in .after(all)
situation? Ignore it? Re-bikeshed the output? Disallow .after(all)
at all?
This implementation doesn't move onto a next
Feature
(instead ofScenario
), until allScenario
s inside pass or drain out their retry attempts. In practice that means you can see anotherScenario
s interleaved with retried ones.
This implies that .after(all)
means "after all scenarios in the feature" rather than "after all test suite", right?
And even those who implement are using only CLI flags (
--retry
)/env vars to specify retry value for all scenarios. Moreover I couldn't find any implementation with streaming live output like ours. Other implementations just provide a bunch of final report formats, so you have to wait until the end or use--fail-fast
CLI option. And none of them allow specifying the retry strategy.
Actually, I liked the --retry-tag-filter
feature of the JS implementation. May be worth supporting in addition to what've discussed.
@tyranron
Haven't got it. What do you mean by "re-outputted"?
Whole discussion about re-outputting errors
is linked to output retries as a separate branch. By that I mean, that if we choose to go this route, retries can be outputted pretty far from each other, so maybe we should give users a clear idea, what went wrong on a previous run(s?).
But how do we resolve the issue with the stuck output in
.after(all)
situation? Ignore it? Re-bikeshed the output? Disallow.after(all)
at all?
Thinking more and more about this, I think that we can ditch .after(all)
at all. At least in a sense "run after all other scenarios". Users have no idea, when this will actually happen, I can't imagine a use-case, when this will be actually useful. Especially when we have @serial
, which can also be combined with @retry
. I think it originated just from nextest
planning to have a similar functionality, not based on an actual need.
This implies that
.after(all)
means "after all scenarios in the feature" rather than "after all test suite", right?
Hm, I haven't thought about it, but this seems to be more useful, then previous meaning of .after(all)
. Especially when you consider, that Feature
s usually combines Scenario
s linked to the same functionality, that can be a main cause of flakiness. I need to see though, how hard it would be to implement this and how this would interact with retries after timeouts.
Actually, I liked the
--retry-tag-filter
feature of the JS implementation. May be worth supporting in addition to what've discussed.
Actually I've already implemented more powerful feature, basically like Cucumber::which_scenario()
: closure decides how many and when retries should happen.
@ilslv
By that I mean, that if we choose to go this route, retries can be outputted pretty far from each other, so maybe we should give users a clear idea, what went wrong on a previous run(s?).
I think that just print the error "as is" and later having the | Retry #<num>
label is more than enough. Like here, but with an error.
I think that we can ditch
.after(all)
at all.
I'm not against this. I've thought about it too.
Actually I've already implemented more powerful feature, basically like
Cucumber::which_scenario()
: closure decides how many and when retries should happen.
That's OK, but the concern I've raised is not about power, but rather ergonomic and CLI. I could easily imagine the situation when someone wants to retry the test suite without populating @retry
tags here and there. Like --retry=3 --retry-tag-filter='@webrtc or @http'
and then --retry=2 --retry-tag-filter='@webrtc or @http or @animal
. It's OK if it will be built on top of Cucumber::which_scenario
, but I'd vote to have this in CLI as the use cases and ergonomics benefits are quite clear.
Discussed with @tyranron:
.after(all)
entirely, only .after(3s)
is allowed. Feature
/Rule
branch. --retry
, --retry-after
and --retry-tag-filter
CLI optionsLet's explore how different sets of CLI options would interact with the following Feature
:
Feature: Retries
@retry
Scenario: First
Given a moral dilemma
@flacky @retry(3)
Scenario: Second
Given a moral dilemma
@retry(4).after(5s)
Scenario: Third
Given a moral dilemma
Scenario: Fourth
Given a moral dilemma
Scenario
retried once without a delayScenario
retried 3 times without a delayScenario
retried 4 times with 5 seconds delay in betweenScenario
isn't retried--retry=5
Scenario
retried 5 times without a delayScenario
retried 3 times without a delayScenario
retried 4 times with 5 second delay in betweenScenario
retried 5 times--retry-tag-filter='@flacky'
Scenario
retried once without a delayScenario
retried 3 times without a delayScenario
retried 4 times with 5 second delay in betweenScenario
isn't retried--retry-after=10s
Scenario
retried once with 10 seconds delay in betweenScenario
retried 3 times with 10 seconds delay in betweenScenario
retried 4 times with 5 seconds delay in betweenScenario
isn't retried--retry=5 --retry-after=10s
Scenario
retried 5 times with 10 seconds delay in betweenScenario
retried 3 times with 10 seconds delay in betweenScenario
retried 4 times with 5 seconds delay in betweenScenario
retried 5 times with 10 seconds delay in between--retry=5 --retry-tag-filter='@flacky'
Scenario
retried once without a delayScenario
retried 3 times without a delayScenario
retried 4 times with 5 second delay in betweenScenario
isn't retried--retry=5 --retry-after=10s --retry-tag-filter='@flacky'
Scenario
retried 5 times with 10 seconds delay in betweenScenario
retried 3 times with 10 seconds delay in betweenScenario
retried 4 times with 5 second delay in betweenThank you very much for the hard work @ilslv @tyranron . That's a fantastic implementation!
I was wondering if it's something that we can add to the framework. We already have a function to repeat failed tests output but we don't have a function to run failed tests again. It's common to have flaky tests that sometimes can be solved by running them again.
For example :
Where 2 is the maximum number of times the tests should be run again in case of test failures. Let's say we have the tests
A
,B
,C
andD
:A
passesB
,C
andD
are run again (1 replay left if we have new test failures)C
andD
(0 replay left)D
fails again, this one is certainly more than just unstableRegarding the output we have two choices :
D
in this example)2
, or a previous run if all tests are passing)