Automattic / wp-e2e-tests

Automated end-to-end tests for WordPress.com
https://github.com/Automattic/wp-calypso
GNU General Public License v2.0
110 stars 25 forks source link

Auto-retry failing tests #1211

Closed dmsnell closed 6 years ago

dmsnell commented 6 years ago

It's been a while since I had a PR which truly failed the e2e tests, but it's been a while since I had a PR where the e2e tests didn't fail. In almost every case I'm able to resolve the error by restarting the tests.

This is frustrating because it eats away at my trust of the e2e tests and because it adds several long steps to my process of deploying PRs. First I have to prepare the PR, then wait for the e2e (or canary) tests to fail, then I have to come back to the PR (because it takes at least several minutes to run) and see if they failed again or succeeded, then I have to open it up, read that the failure was for an unrelated reason, click to restart the tests, then repeat until they pass.

I realize that there are bugs in our tests but I think that we can open up a path to find and fix those bugs while minimizing the impact they have on developers iterating on Calypso. If we had an auto-restart feature which would maybe retry failing tests up to three times before failing it for real then most of these test failures could be gracefully handled.

On each test run, if it fails, trap the output and try again. If it succeeds during a retry attempt bump a stat and record the output of the failed cases to table for deeper investigation. By doing this then I wouldn't have to investigate so many PRs that give the impression that they fail the tests when actually it's the tests themselves which are failing.

alisterscott commented 6 years ago

Thanks for raising this @dmsnell

We already had a retry in place which was set to 2 times - which on the calypso.live environment doesn't seem to be enough. I've upped this to 3 times in this PR #1210 so 🤞this increases the stability of results.

Let's wait and see.

blowery commented 6 years ago

We already had a retry in place which was set to 2 times - which on the calypso.live environment doesn't seem to be enough.

Have y'all been seeing a consistent failure mode with dserve?

blowery commented 6 years ago

I've noticed that when the e2e tests fail. they tend to fail with timeouts and the default timeout seems to be 20s. 20s feels like a long time to wait when we get into a failure condition.

It would be interesting to know how long success typically takes (mean and stddev) and then set the failure time to the mean + 2 standard devs (assuming that's less than 20). That might let us notice failure more quickly and speed up the test runs.

alisterscott commented 6 years ago

Checking back in, I think we've resolved the stability issues through fixing the magellan permissions issue by explicitly setting it, and also through dserve enhancements for live branches.

Some links:

https://circleci.com/build-insights/gh/Automattic/wp-e2e-tests-canary/master

screen shot 2018-06-13 at 5 38 27 pm

https://circleci.com/build-insights/gh/Automattic/wp-e2e-tests-for-branches/master

screen shot 2018-06-13 at 5 41 00 pm
alisterscott commented 6 years ago

I'm going to close this one since we use retries and the canaries are stable:

screen shot 2018-07-13 at 3 37 31 pm