Open wrongu opened 1 year ago
We need to work on this issue to make CI less of a hassle, but I’m not convinced fixing a seed is the approach we should take
Merging #31 (ccee362) into main (f845628) will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## main #31 +/- ##
=======================================
Coverage 95.76% 95.76%
=======================================
Files 9 9
Lines 260 260
=======================================
Hits 249 249
Misses 11 11
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
This could alternatively be configured to run twice, once from seed and once without. Or n times with n seeds. Or any further combination.
Ultimately up to you of course. Just a suggestion.
I didn't see a related issue this solves, but the PR indicates it's failures for some RNG seeds. So the obvious question is why there are failures. Is it just too strict a test interval like testing that 95% intervals cover true results and cause failures 5% of the time? In that case, we can either (a) widen the test interval, or (b) do repeated tests (if we do 20 tests, we expect binomial(.05, 20) failures and can test that expectation), or (c) fix seeds.
The downside of (a) is that it reduces test power, the downside of (b) is a lot more computation, and the downside of (c) is brittle tests. I don't think there's an optimal solution, so we just need to pick something.
Yes, the tests for e.g. HMC have tolerances that mean they still fail on some runs.
Our current test suite runs in ~12 seconds, so we have a lot head room to run more iterations/tests but we need to keep an eye on this as the number of algorithms increases. At the moment for the purposes of reviewing I've been doing a sort of hand-wavey equivalent of (b) by seeing that only one or two test runs (out of the 12 we do in our matrix of python version and OS) failed
I recently stumbled on https://github.com/pytest-dev/pytest-rerunfailures, which could be another approach we use (maybe only as a stopgap)
I see the appeal of pytest-rerunfailures, especially as a stopgap.
Brian's "hand-wavey" strategy seems good to me, especially if we just couch it in stats-speak. Modify Bob's suggestion (b) to test for 90% coverage (or some such number not too much smaller than 95%), e.g. at least 18 of 20 repeated tests pass, instead of boolean testing the expectation. I think this would allow more wiggle room while still protecting test power.
Then pytest-rerunfailures maybe once or twice, in case we didn't build in enough wiggle room and we find ourselves frustrated by the tests.
Add conftest.py and use it to set a random seed for all pytest tests.
This PR makes pytest deterministic and solves the problem of tests randomly failing due to unlucky RNG. There are pros and cons to this, of course. I find that it tends to help in CI settings.