The flaky tests are occasionally annoying, as sometimes they fail. They also can get in the way of trying to track down where something goes wrong in a full test run with pytest -v --pdb, becasue you'll get an interruption on the first run of a flaky test even if it's going to fail later.
Make all the flaky tests non-flaky by using predefined random seeds.
There's an argument to be made that we're testing the code statistically by using these tests with different random seeds. However, because what we always end up doing is just rerunning the tests when flaky tests fail (rather than investigating why the failed), this isn't really working. What's more, a flaky of 3 or 5 isn't a lot of statistical samples to really test it; the test fail often enough that we would waste a lot of time doing the investigation of why the flaky tests sometimes failed if we really did do that investigation. As a result, they're an annoyance that aren't really serving the purposse that they nominally serve.
It would be worth putting in statistical tests that do real statistical tests (so, enough trials that your statistics are more robust) to make sure variances are coming out more or less where expected, but then perhaps gate those behind an environment variable so that they aren't run as part of github actions.
The flaky tests are occasionally annoying, as sometimes they fail. They also can get in the way of trying to track down where something goes wrong in a full test run with
pytest -v --pdb
, becasue you'll get an interruption on the first run of a flaky test even if it's going to fail later.Make all the flaky tests non-flaky by using predefined random seeds.
There's an argument to be made that we're testing the code statistically by using these tests with different random seeds. However, because what we always end up doing is just rerunning the tests when flaky tests fail (rather than investigating why the failed), this isn't really working. What's more, a flaky of 3 or 5 isn't a lot of statistical samples to really test it; the test fail often enough that we would waste a lot of time doing the investigation of why the flaky tests sometimes failed if we really did do that investigation. As a result, they're an annoyance that aren't really serving the purposse that they nominally serve.
It would be worth putting in statistical tests that do real statistical tests (so, enough trials that your statistics are more robust) to make sure variances are coming out more or less where expected, but then perhaps gate those behind an environment variable so that they aren't run as part of github actions.