Support for Array API - Githubissues

Hello folks, I want to ascertain whether Hypothesis is interested in having generalised "library-agnostic" strategies for the Array API libraries (NumPy, TensorFlow, PyTorch, MXNet, JAX, Dask & CuPy are listed as primary stakeholders). If so I would be up to implement such strategies and open a PR in a few weeks, but I would need guidance.

I've been developing such strategies at honno/hypothesis-array-api with heavy reference to hypothesis.extra.numpy and the related internal test suite. These strategies have no dependencies and just assume an Array API-compliant has been monkey-patched to the variable array_module.

No library 100% adopts the standard right now (NumPy is getting close numpy/numpy#18585) but only using some key parts of the API should get us a powerful arrays() strategy. In areas of non-compliance I've been throwing errors on missing required attributes/methods and warning users when we can still generate some things. For example PyTorch doesn't support all the unsigned integers specified in the Array API (only uint8) , so if a user uses the unsigned_integers_dtypes() strategy (with no arguments) only the uint8 dtype will be generated and the user is warned that the dtypes uint16, uint32 & uint64 are not available.

My biggest concern is how users should tell Hypothesis what Array API library to use. My current plan is to have the array module as an optional kwarg in the strategies, which if not specified defaults to a global variable specified by a register_array_module() method.

Do note that the limited feature set of the API means functionality from hypothesis.extra.numpy could not be achieved through a library-agnostic approach. Additionally helpful properties of the NumPy strategies, such as array_shapes() not accepting dimensions above 32 due to NumPy's limits, would either require some checks on runtime or be a nicety scrapped altogether... maybe a purely library-agnostic approach first would let us determine if library-specific checks could be included nicely or not. My first thought is just keeping the numpy extra as-is and having an arrays submodule or something be standalone.

So yeah, I'm interested to hear if these strategies could see a future inside Hypothesis, and otherwise I'd generally appreciate input. My priority is to make honno/hypothesis-array-api feature complete and emulate hypothesis.extra.numpy concepts like fill values in arrays(), and then if appropriate I'll work on a PR.

Please ask me any questions or if you need clarification on something! I'll cc @asmeurer as they tasked me to create library-agnostic Array API strategies to extend the use of Hypothesis in the Array API's compliance suite data-apis/array-api-tests and may have some ideas.

For some more context on this. @Zac-HD and I discussed this a bit at some point last year, when I started working on the array API test suite. At the time Zac was open to the idea, but it hasn't yet been implemented. I have since developed quite a bit of the array API test suite, which uses hypothesis extensively. However, the parts of the suite that generate arrays currently only generate constant arrays, because the arrays() strategy hard-codes NumPy. We do not want NumPy to be a dependency of the test suite (actually, it unfortunately currently is because we use the mutually_broadcastable_arrays strategy). I also was not able to just copy and modify the arrays code into the test suite because of licence differences. So at present, allowing the arrays and mutually_broadcastable_arrays strategies be able to be array API independent, and not import NumPy unless NumPy is the array library that is being used, would directly help the array API test suite. But more broadly, support for this would allow people to use hypothesis with a large number of popular libraries like PyTorch, Tensorflow, Jax, CuPy, Dask, etc.

For those strategies that would be used in the array API test suite (arrays() in particular), we need to be careful to not use any APIs that aren't part of the array API specification, as that would defeat the whole purpose of using it in the array API test suite. The good news here is, for the dtypes and indexing strategies, the array API test suite does not use the ones in hypothesis.extras.numpy at all. This is because the array API spec has a very limited set of dtypes and specifies a very limited subset of required indexing semantics, so I have instead built very carefully handcrafted strategies that exactly match the array API spec. So outside of basically arrays(), a more pragmatic approach may be needed for the time, given that no library presently supports the array API specification 100%. This may include, for instance, special casing behaviors and APIs for specific libraries. The array API specification also has nothing to say about several things in the current hypothesis.extra.numpy module, e.g., string dtypes are currently not mentioned at all in the array API spec. It may make sense to limit those to just NumPy for now.

I'd be very happy to ship a (e.g.) hypothesis.extra.array_api module - the standard is a very exciting development, and I'd love Hypothesis to have great support and help library maintainers and consumers to adopt it.

I'm a little concerned about stability, in that the standard doesn't seem quite final yet and nor do we have independent implementations of the standard. This can easily be addressed; though shipping an external hypothesis-array-api package will be more flexible for users than an explicitly-experimental module in Hypothesis - to avoid forcing updates just to get a working version of these strategies in future.
Unfortunately I don't have time for ongoing unpaid engagement on the project. Happy to review once you think it's ready, or to set up a consulting thing, but the reason that nothing happened since Aaron and I last spoke is that any time I get free from my PhD and other work is going to be spent fixing bugs rather than adding new features.
I think we should keep independent strategies for Numpy and the generic-array-API; they're actually pretty different when we look at the details of everything from dtypes or endianness to (scalar) array shapes.
I'm not a fan of the global array_module constant; it seems likely that this would make differential testing of multiple modules pretty awkward. Perhaps a function get_strategies_namespace(array_module), returning a SimpleNamespace or similar of functions-returning-strategies with the array_module bound in?

CC @rsokl; I know you're busy but probably also interested.

Glad to hear this could see a future inside Hypothesis :)

Regarding stability, having an external package coexist sounds good. My impression has been that array creation via asarray() and how it allows for nested sequences of Python builtins is rater critical for an arrays() strategy and is something fortunately well agreed upon, but there'll be odd uncertainties like data-apis/array-api#152 which warrants a flexible external package.

And yeah the array_module constant is awkward, I will play around with a "register mechanism" with a get_strategues_namespace()-ish method and get feedback from folk like @asmeurer who would be using these Array API strategies.

I'll be figuring out the implementation details for now externally but I will get to work on a hypothesis.extra.array_api PR at some point... maybe I'll have something ready for review end of August. I'll of course be watching this issue if there is any more input until then.

HypothesisWorks / hypothesis

Support for Array API #3037