Add a strategy for new Numpy PRNGs

rsokl commented 3 years ago

This is going to be a somewhat sprawling issues. All of the topics here involve Hypothesis' approaches to making random code deterministic. I will happily close this and turn it into a collection of modular issues/PRs, but first I want to lay everything out and get @Zac-HD 's input.

Weakrefs

(Addressed in #3135 )

We should only make weak references to the generators that we manage (as well as other "register" functions that Hypothesis provides)

NumPy

NumPy has moved away from its old global random state (e.g. np.random.seed, np.random.uniform, etc.). In favor of a new RNG system that uses a combination of bit-generators and generators. This API is very different from those of global-state RNG systems. Presently, it is not clear how a user should have Hypothesis make their numpy.random code deterministic.

To me, the bare-minimum would involve identifying the appropriate substitutes for seed, get_state, and set_state in terms of the new bit-generatore/generator system, and provide a shim to make it trivial for users to register this new source of RNG.

A much more ambitious goal is to still, magically, handle all of this for the user. The only thing that comes to mind is to have NumPy register the creation of new generators, and we then tap into that registry to manage those generators. I would not be surprised if NumPy (understandably) does not want to do this.

Some near-term To-Dos:

Become familiar with the new system, and assess if there are obvious substitutes for seed, get_state, and set_state for users to leverage
Post to NumPy's mailing list about our desire to make random tests behave deterministically -- under this new system -- and see if anyone has any ideas

Useful reference material

PyTorch

See if PyTorch is willing to add a plugin so that Hypothesis will manage their global generator like this (but with register_random instead of register_type_strategy).

Additionally, torch also supplies a Generator. I recall reading that PyTorch was planning to redesign things like DataLoaders to accept generators, which is similar to the new best practices for NumPy's RNG. Thus, any solution we cook up for the NumPy case should be designed to be future-compatible here as well.

Edit: I just realized that PyTorch actually uses Hypothesis for some of its tests. As far as I can tell, they do not use register_random in their test suite

Zac-HD commented 2 years ago

Isn't the whole point that of these new interfaces that users explicitly pass the generator object around?

If so, we only need to register the global PRNG that the generators are seeded off, and everything will work from there.

rsokl commented 2 years ago

Isn't the whole point that of these new interfaces that users explicitly pass the generator object around?

Yep, that is correct!

we only need to register the global PRNG that the generators are seeded off

My understanding is that the generator objects are not seeded off of a global generator, and that they can only be seeded independently; I think being able to use a global PRNG would defeat the purpose of numpy's redesign. The reason why the new system expects folks to pass around generator objects is that those generator objects can be used/seeded without concern that, in some other portion of the code, the generator object is silently getting re-seeded.

Zac-HD commented 2 years ago

So what do we need to do then? I was thinking of monkeypatching np.random.default_rng() to use a known seed when passed None, instead of (or by) controlling the PRNG that seed would otherwise be drawn from.

If the user passes an explicitly-seeded PRNG, it should be pretty obvious what's happening when or if we raise Flaky.

rsokl commented 2 years ago

When making this post, my thoughts were that we would involve identify the appropriate substitutes for seed, get_state, and set_state in terms of the new bit-generator/generator system, and provide a shim to make it trivial for users to register their new sources of RNG. I still think that this is a good path forward, although I'll be interested if folks from the NumPy mailing list have other ideas.

I am hoping to eventually find some time to loop back and hit some of the To-Dos that I laid out in my original post. It is just a matter of me scrounging up time to do so.

rsokl commented 2 years ago

Oh! We could also make a strategy in hypothesis.extra.numpy that hands a user a generator that they can pass to their test code/other strategies, and that we manage for them (this would still involve our figuring out the seed/get_state/set_state substitutes)! This probably is an even more convenient and obvious (and easy to document) solution for users.

Zac-HD commented 2 years ago

Based on a quick conversation, we plan to:

add a new strategy npst.rngs() (todo better name), which will basically be st.builds(np.random.default_rng, st.integers()) with a nicer repr - much like st.randoms(use_true_random=True).
have our seed-and-restore logic monkeypatch default_rng() in order to use a constant seed instead of a random seed, much like we set the state for global Random instances (or use a drawn seed with st.random_module(), etc.). People should use the former, but it's important that we give a nice user experience even if without best-practices.

matteoacrossi commented 9 months ago

Are there any plans to address this issue?

Zac-HD commented 9 months ago

Hypothesis is an all-volunteer project, and so far people have been volunteering on other issues instead.

If you're interested in helping out, I'm very happy to support that through advice, code review, and so on 😊

matteoacrossi commented 9 months ago

I would love to contribute but I don't know the internal workings of hypothesis. I was looking at #3510, is that a good starting point?

Zac-HD commented 9 months ago

Yep, that's a great place to start!

I think this should be a pretty self-contained change - it'd be perfectly feasible to implement this strategy downstream, we want to provide it in hypothesis.extra.numpy to make users' lives easier rather than because it needs internals 🙂

HypothesisWorks / hypothesis