data-apis / array-api-tests

Test suite for the PyData Array APIs standard
https://data-apis.org/array-api-tests/
MIT License
63 stars 39 forks source link

Faster default runs #246

Closed honno closed 5 months ago

honno commented 5 months ago

Some changes that significantly speed-up the run of the test suite:

honno commented 5 months ago

(Seems my special case changes broke some things, couldn't figure it out now so will have a look next week :sweat_smile:)

asmeurer commented 5 months ago

Need to update the README for the new max-examples default.

honno commented 5 months ago

Nothing should be controversial now so will merge as nice to get in.

asmeurer commented 5 months ago

Hmm, the tests run suspiciously fast now (under a minute). Are we really still testing everything as rigorously as we used to? I'm all for faster runs, but we also want to make sure that actual failures are caught, ideally without having to rerun the tests multiple times.

honno commented 5 months ago

Hmm, the tests run suspiciously fast now (under a minute). Are we really still testing everything as rigorously as we used to? I'm all for faster runs, but we also want to make sure that actual failures are caught, ideally without having to rerun the tests multiple times.

I think we're striking a nice balance. The CI job is an extreme example as array_api_strict/NumPy is fairly Hypothesis-friendly and so the tests run quickly, plus we have a skips file for any failures. Something like jax.experimental.array_api still takes ~7mins to run locally, where array_api_strict takes 20 seconds... that is a rather big jump tbf, but admittedly JAX has more errors too which really contribute to the timings.

asmeurer commented 5 months ago

Is this primarily coming from vectorization? If we're really running this fast, I don't see the need to reduce the default number of examples.

honno commented 5 months ago

Is this primarily coming from vectorization?

Vectorization definitely helps, it's also that we're now reducing the max examples for the tests which aren't vectorized, plus only using a single example for the special case tests.

If we're really running this fast, I don't see the need to reduce the default number of examples.

I think the slowest adoptor[^1] should be our lowest common denominator for how we set default runs of the test suite, i.e. JAX is taking 7mins (which even then is longer than ideal for a default run IMO). Then we can leave it to users—and ourselves—to bump the max examples for CI jobs.


On a general note on examples: in my experience it's often only a few simple examples that cause failures. So the super interesting examples that come with a larger max are not that rewarding for the significant time add.

[^1]: not to say JAX is slow, more-so how it interplays with how the test suite works. The biggest element is error reporting is slow because Hypothesis stresses through the code its testing to make cleaner and reproducible examples.

rgommers commented 5 months ago

Agreed with @honno's comments above - default number of examples should be optimized for fast test suite execution time, and having a high number of examples really isn't very interesting anyway. I'd honestly set it to 1, not 20. 1 is regular CI "is everything implemented and does it all work". 20/200/inf is mostly repeating the same thing over and over, and for when it does actually add value - that is fuzz testing. 7min is still slow for regular CI usage.

asmeurer commented 5 months ago

Very strong disagree. A very low number of examples do not actually "run the tests" in the way that people think that they do, and is very misleading. If you look at the tests in the test suite, each test is testing 10 or 20 different things. It's very hard for a single, randomly selected example to be representative enough to check everything. Consider that different "examples" mean different combinations of things like shapes and dtypes. Testing one example, as Ralf suggests, would mean just checking an array with just one of the supported dtypes with some shape, and some values.

I tested running test_add (a function with a relatively simple possible set of inputs) with one example, and the example arrays hypothesis generated was (array(0., dtype=uint8), array(0., dtype=uint8)). Furthermore, I ran the exact same command (ARRAY_API_TESTS_MODULE=numpy.array_api pytest --max-examples=1 -s -k test_add array_api_tests/test_operators_and_elementwise_functions.py --disable-warnings) multiple times, and it generated that exact same example every single time.

This is how hypothesis works. It generates interesting examples, where "interesting" means either corner cases (like this one) or general cases (like a "random" array). But it can only do this if you give it a chance to. There's a reason the hypothesis default is 100 examples per run.

If we set max-examples to 1, test_add would not test broadcasting, behavior on floating-point dtypes, complex number support, support for other integer dtypes, or even that add(x, y) is actually performing addition. It would be like having a single test

def test_add():
    x = array(0., dtype=uint8)
    y = array(0., dtype=uint8)
    assert_all(add(x, y), array(0, dtype=uint8))

This would obviously not be a very good test for the add function in the standard.

I think the slowest adoptor1 should be our lowest common denominator for how we set default runs of the test suite

Strong disagree here too. We should optimize for the common case, not the worst case. JAX can manually set their example count lower in their CI if they want to. It's easy to do that with the flags we've provided. But the default behavior should be something reasonable for most libraries, who aren't going to set that flag (and probably don't even realize that it is there).

I agree that we need to make the CI times reasonable, but the other improvements like vectorization have done this enough that I don't see the need to lower the default to 20 any more. With the latest version of the test suite, NumPy 2.0 runs in 16 seconds with max-examples set to 100. If that's not fast enough for you, I really don't know what to say. Really, if anything, we should take this as a opportunity to increase the number of examples that are run by default and improve the default test coverage. Again, tests that don't get covered in the first (or tenth or hundredth) run of the tests are very common. Just look at how often I have to update the XFAILs file for array-api-compat because some rare error only pops up sometimes on CI https://github.com/data-apis/array-api-compat/commits/main/numpy-1-21-xfails.txt

Note that I am only talking about the normal tests here. I am fine with keeping the special case tests at a lower number of examples. Although even there we should be careful that some special cases are not fully covered by one example, and we should be cautious of the apparent hypothesis behavior of giving zero variability when max-examples is set to 1.

leofang commented 5 months ago

I do not believe a rigorous test suite would only take less than 1 min to run. This is simply impossible. Please do not over optimize. I second what Aaron said.

rgommers commented 5 months ago

If you look at the tests in the test suite, each test is testing 10 or 20 different things.

Well okay, that is then not going to be possible with 1 example, I didn't know that (and it doesn't have to be that way, it's a choice made per test by the test author it looks like). This is a little problematic I think for a test suite design if you care about runtime, because it also means that 20 examples isn't enough because you have no guarantees that the first 20 examples are testing the 20 different things. I guess it becomes a stochastic thing - 25 is probably still not enough to reliably test everything; 30 or 40 has a high probability of being comprehensive, etc.

I do not believe a rigorous test suite would only take less than 1 min to run. This is simply impossible.

This is definitely not impossible. The test suite for numpy runs in O(100 sec), and tests (a) 10x more functions, (b) more computationally expensive functions on average, and (c) actually needs to compare numerical accuracy (unlike this test suite, where there's no expected answer that must be met). So if the test suite was written with similar techniques/efficiency as for numpy, then testing numpy's implementation shouldn't take more than 5-10 seconds.

Again, tests that don't get covered in the first (or tenth or hundredth) run of the tests are very common. Just look at how often I have to update the XFAILs file for array-api-compat because some rare error only pops up sometimes on CI

Yes indeed. This is a problem. This is not regular test suite usage. Instead, this is fuzz testing. Random failures like that should not happen for regular regression testing.

and we should be cautious of the apparent hypothesis behavior of giving zero variability when max-examples is set to 1.

Zero variability is what you want in regular usage. "flaky tests" are an issue for pretty much every large library. I know hypothesis doesn't make it easy to distinguish between these two fundamentally different flavors of testing (regression vs. fuzz), but both matter - and libraries typically want the former in their repos.

asmeurer commented 4 months ago

Ralf, it sounds like you are more discussing the merits of property-based testing in general. I don't know if it's really worth my time to try to convince you of its value. The fact is, hypothesis tests are the type of tests we chose to use for the array-api-tests test suite. It's my personal experience that you get much more value out of hypothesis testing if you embrace its paradigm instead of trying to fight it or shoehorn it into a traditional testing paradigm.

I've made a PR resetting the default max-examples back to 100. https://github.com/data-apis/array-api-tests/pull/254. You can see at https://github.com/data-apis/array-api-strict/pull/34 I set the max-examples to double that (200), and the whole CI runs in a minute (that includes installing dependencies). 100 is the default value that hypothesis developers themselves chose, so I think we shouldn't change that, especially to something lower, unless there's a good reason.

asmeurer commented 4 months ago

Regarding special-cases, is hypothesis used at all now? I ran

ARRAY_API_TESTS_MODULE=array_api_strict pytest "array_api_tests/test_special_cases.py::test_unary[asin(x_i > 1) -> NaN]" -s --hypothesis-verbosity=verbose --disable-warnings

(to pick a random example), and hypothesis didn't output anything. It might be worth continuing to use hypothesis at least for special cases that refer to multiple values, like this one. And also, as I said, ensuring that it actually varies the example across runs.

honno commented 4 months ago

Regarding special-cases, is hypothesis used at all now?

Yes but not in a typical @given-decorated test case way, just to decoratively define special cases and then only one example is pulled. Something I'll write an issue on to discuss.