Proposal: a high-level test suite to check equivalence of array modules

Zac-HD commented 4 years ago

Hi all - @rsokl and I saw the launch announcement a few days ago, and as frequent array users we're very excited.

That's not why I'm here though: we want to propose that the Consortium provide a high-level test suite for the interoperable or standardised parts of an array API.

We're core developers of the Hypothesis library for property-based testing (or structured/semantic fuzzing), and have done a lot of work on Numpy support and found a bunch of bugs (with fixes being the limiting factor). I wrote up some advice as a paper for SciPy 2020. Key points for the Consortium:

checking independent implementations for equivalent behaviour ("differential testing") has found very many bugs in e.g. compilers, and doesn't require a formal spec. This pattern also makes running the test suite on new modules very easy!
randomised testing of even simple properties ("does not crash", "save/load is a no-op", etc) also finds very many bugs - in my experience most test functions I write for Numpy find novel bugs.
Hypothesis ships with a rich suite of functions to describe input data, including for Numpy and Pandas, and is already used by {numpy,astropy,pandas,xarray,...}. This extends to e.g. sets of mututally-broadcastable shapes suitable for a given gufunc signature.
We even have a "ghostwriter" to help get started - try e.g. hypothesis write numpy.matmul or hypothesis write --equivalent numpy.add mxnet.add pytorch.add
As a proof of concept, @rsokl's autograd library mygrad relies heavily on Hypothesis to check both equivalence with Numpy and correctness of gradients; he credits property-based testing with making the project possible. Many of mygrad's tests could easily adapted to form the core of a standard array module test suite - for example here are all two of the tests for matrix multiplication, testing forward and backprop through any operation on any arrays.

We're both keen to see the Consortium succeed, and happy to help out with testing if that would be welcome.

asmeurer commented 4 years ago

Yes, we do plan on having a test suite that corresponds to the spec. I am working on it, and I plan on using Hypothesis for at least parts of it. The in-progress test suite is still in a private repo, but I'll ping you when it is made public.

Actually I did have one concern, which maybe you can help with. We want to be able to generate example arrays for testing. But the test suite has some constraints:

It should only use the API functions that are part of the spec. This is a limited subset.
It should not depend on NumPy as a dependency.
It needs to be MIT licensed.

So we can't use hypothesis.extra.numpy because it depends on NumPy, and even if it allowed parameterizing the module, we can't control what NumPy functions it uses. I also can't copy the code from it because it is MPL licensed. For now, my plan is to implement a basic random array strategy using the built-in hypothesis primitives (we also only care about basic int and float dtypes, so this isn't a huge deal). This won't be as fancy as what hypothesis.extra.numpy does with sparse arrays, so hopefully it still shrinks and catches corner cases properly. The majority of things that are in the spec are elementwise functions, so it shouldn't really matter for those I'd imagine.

Zac-HD commented 4 years ago

Sounds good - I look forward to seeing it!

I think that upstream support would probably be easier than you think, and supporting other array libraries is definitely the kind of thing that @rsokl and I would be interested in. Maybe we could set up a call to discuss what you'd need?

Hypothesis' usage of Numpy

``` $ grep -oEi "np\.([0-9A-Za-z\._]+)" -- extra/numpy.py | sort | uniq -c | sort -r 19 np.dtype # mostly type annotations 4 np.zeros 3 np.matmul.signature # in a docstring 2 np.newaxis 2 np.ndarray # type annotations 2 np.issubdtype # for integer array indices 1 np.void.type # in comment 1 np.signedinteger # for integer array indices 1 np.putmask 1 np.prod 1 np.lib.function_base._SIGNATURE # secondary error message handling 1 np.isnan 1 np.integer # for integer array indices 1 np.int8 # docstring 1 np.full 1 np.float # docstring 1 np.empty # comment explaining we prefer `np.full()` ```

So if we exclude some of the blacker magic for integer_array_indices() and the dtype generation strategies, I think we only really need putmask(), isnan(), zeros(), and full(). If compatibility is more important than performance we could literally just create arrays and assign elements to indices and use the stdlib math module for the rest.

Is there any documentation I could look at to get a sense for dtypes in this new world? Everything else is going to be straightforward if occasionally fiddly.

TomAugspurger commented 4 years ago

In case it's helpful, pandas has done something similar with our extension array interface. https://pandas.pydata.org/docs/development/extending.html#testing-extension-arrays

Pandas defines the base test classes, and downstream librarires inherit from our base class

from pandas.tests.extension import base

class TestConstructors(base.BaseConstructorsTests):
    pass

And are responsible for providing data as pytest fixtures (which may be generated by hypothesis).

Zac-HD commented 4 years ago

data as pytest fixtures (which may be generated by hypothesis)

I'm definitely on board with using Hypothesis to generate data, but piping the generated data through pytest fixtures breaks both shrinking and replay of previous failing examples. You still get the lovely API for describing random data, but lose two killer features... fortunately it's easy to design alternative mechanisms which don't have that problem :grin:

e.g. use st.data(), so you can define the array module as a class attribute - this could look something like

# For downstream users 

class TorchArrayTests(ArrayModuleTests):
    array_module = pytorch

# Implemenation sketch

import ABC 
from hypothesis import given, strategies as st 

class ArrayModuleTests(abc.ABC):

    @property
    @abc.abstractmethod
    def array_module(self):
        """The array module to compare to Numpy."""
        raise NotImplementedError

    def __init__(self):
        # Magic introspection and test-method-generation code here.  Results look something like:
        @given(st.data())
        def test_op(self, data):
            args = magic_get_args_for(self.array_module, "op", data)
            assert np.op(*args) == self.array_module.op(*args)  # for example

asmeurer commented 4 years ago

Maybe we could set up a call to discuss what you'd need?

Yes, let's set up a call. Can you email me at asmeurer@quansight.com and we can coordinate?

Zac-HD commented 2 years ago

Closing this issue because https://github.com/data-apis/array-api-tests exists 🎉

data-apis / consortium-feedback

Proposal: a high-level test suite to check equivalence of array modules #4