flatironinstitute / bayes-kit

Bayesian inference and posterior analysis for Python
MIT License
43 stars 3 forks source link

Make tests deterministic from seed #31

Open wrongu opened 1 year ago

wrongu commented 1 year ago

Add conftest.py and use it to set a random seed for all pytest tests.

This PR makes pytest deterministic and solves the problem of tests randomly failing due to unlucky RNG. There are pros and cons to this, of course. I find that it tends to help in CI settings.

WardBrian commented 1 year ago

We need to work on this issue to make CI less of a hassle, but I’m not convinced fixing a seed is the approach we should take

codecov-commenter commented 1 year ago

Codecov Report

Merging #31 (ccee362) into main (f845628) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##             main      #31   +/-   ##
=======================================
  Coverage   95.76%   95.76%           
=======================================
  Files           9        9           
  Lines         260      260           
=======================================
  Hits          249      249           
  Misses         11       11           

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

wrongu commented 1 year ago

This could alternatively be configured to run twice, once from seed and once without. Or n times with n seeds. Or any further combination.

Ultimately up to you of course. Just a suggestion.

bob-carpenter commented 1 year ago

I didn't see a related issue this solves, but the PR indicates it's failures for some RNG seeds. So the obvious question is why there are failures. Is it just too strict a test interval like testing that 95% intervals cover true results and cause failures 5% of the time? In that case, we can either (a) widen the test interval, or (b) do repeated tests (if we do 20 tests, we expect binomial(.05, 20) failures and can test that expectation), or (c) fix seeds.

The downside of (a) is that it reduces test power, the downside of (b) is a lot more computation, and the downside of (c) is brittle tests. I don't think there's an optimal solution, so we just need to pick something.

WardBrian commented 1 year ago

Yes, the tests for e.g. HMC have tolerances that mean they still fail on some runs.

Our current test suite runs in ~12 seconds, so we have a lot head room to run more iterations/tests but we need to keep an eye on this as the number of algorithms increases. At the moment for the purposes of reviewing I've been doing a sort of hand-wavey equivalent of (b) by seeing that only one or two test runs (out of the 12 we do in our matrix of python version and OS) failed

WardBrian commented 1 year ago

I recently stumbled on https://github.com/pytest-dev/pytest-rerunfailures, which could be another approach we use (maybe only as a stopgap)

roualdes commented 1 year ago

I see the appeal of pytest-rerunfailures, especially as a stopgap.

Brian's "hand-wavey" strategy seems good to me, especially if we just couch it in stats-speak. Modify Bob's suggestion (b) to test for 90% coverage (or some such number not too much smaller than 95%), e.g. at least 18 of 20 repeated tests pass, instead of boolean testing the expectation. I think this would allow more wiggle room while still protecting test power.

Then pytest-rerunfailures maybe once or twice, in case we didn't build in enough wiggle room and we find ourselves frustrated by the tests.