HIPS / autograd

Efficiently computes derivatives of NumPy code.
MIT License
7.02k stars 914 forks source link

feat: migrate to `nox`, run tests in parallel, revamp test infra #632

Closed agriyakhetarpal closed 1 month ago

agriyakhetarpal commented 3 months ago

Description

This PR adds several changes to the testing infrastructure:

  1. Migration from tox's DSL to Pythonic nox, with three sessions: one for linting, one for tests, and one to build a source distribution + wheel and validate them. This is in line with the earlier tox environments, with a few minor changes, and a major one in that SciPy is installed as a dependency while testing instead of having a separate environment for it. The linting session hasn't been activated yet because that will raise several warnings to fix throughout the codebase.
  2. The workflow run now cancels pending/running jobs when new commits are pushed, thereby saving time on repeated pushes. This is enabled through the concurrency: key.
  3. The original testing configuration (as it was before this PR) is retained – without having to have tox -e test and tox-e test-scipy. The tests for macOS and Windows for PyPy are excluded from the GHA matrix instead, because SciPy does not yet provide wheels for PyPy and installing it from source is lengthy (and while still easy, ever so slightly trickier on Windows and macOS as compared to Linux – we could set up a Miniconda environment to test them if needed).
  4. uv is used in the nox sessions and in CI, using https://github.com/yezz123/setup-uv. We don't need any extra caching for our dependencies, because it is faster to download them rather than to retrieve them from a cache.
  5. A [test] set of extras has been added, with a minimal set of test dependencies.
  6. We now use pytest-xdist to run tests in parallel, which completes the tests in 4 seconds on my machine, compared to ~35 seconds in serial mode. This is a ~9x speedup! We can notice similar gains in CI.
  7. I noticed that some tests failed intermittently, which suggested that the testing was not being performed in a deterministic fashion. I added a fixed NumPy random seed as a pytest fixture that runs before every test. This can be improved further if we make more changes to and refactor the tests in the future.
  8. nox -s nightly-tests uses environment variables from uv to download nightly versions of NumPy (and SciPy where applicable) and installs them before testing. Jobs have been included for this purpose in the CI matrix.

There are some more changes needed across the codebase, such as fixing some missing coverage and either using CodeCov or just using an HTML coverage report (see https://hynek.me/articles/ditch-codecov-python/), so that the coverage tracking from the tests is actually made use of. Some of the missing coverage came from things that were associated with Python 2 code, though, so it might be good to remove those entirely once we get to them as a part of linting and autoformatting the entire codebase.

Closes #631 and related to #630

agriyakhetarpal commented 3 months ago

I now see why the PyPy tests were taking a long time – SciPy is being built from the source distribution because it doesn't publish wheels for PyPy, while NumPy does, of course (hence I'm myself answering the question I had). To keep CI time low, I can add another nox session for PyPy (edit: or I'll just keep them in the same session, but maybe add a separate job for better separation of logs?)

agriyakhetarpal commented 3 months ago

It looks like some of the tests have not been entirely deterministic, and failures with differences in precision between macOS Intel vs M-series devices were being reported (only in CI, though, I didn't face any of them locally). Setting a fixed random seed seems to have resolved the problem. This PR enables testing against both Intel and ARM macOS platforms, which we can keep until Intel devices surpass their EoL date.

agriyakhetarpal commented 2 months ago

Just added some additional changes that I had missed – this is ready for review whenever you have the time, @fjosw and @j-towns! I separated out the SciPy tests so that they run on non-PyPy runners in 19dab2a00845f4d4a3b86bb66eaedfac8227ef2f, so that we get faster CI. The pre-commit check will fail until #634 gets merged, so we can ignore that safely.

agriyakhetarpal commented 2 months ago

I have also added tests against NumPy and SciPy nightlies. Both of them support Python 3.10–3.13, which explains the failure across platforms for Python 3.8 and 3.9. We will be dropping 3.8 sometime soon in October anyway, but in the meantime, we could explore ways to make them fail more gracefully somehow or perhaps limit our nightly testing against the Python versions that are supported or the latest Python version. Please let me know your thoughts here.

agriyakhetarpal commented 2 months ago

Thanks for going through the changes, @fjosw! After all, nox is essentially the same as tox, but the Pythonic API might have made it at least slightly easier to read and maintain your own sessions, I hope!

I'm not sure if I have a robust answer for how robust the nightly tests will be – I think it depends a lot on our availability to ensure that we can land fixes at the right time since breakages will be critical to our users. At the same time, if we implement an upper pin on NumPy, it could, in rare circumstances, conflict with our ability to run the nightly tests, too. Say, if we have a pin on NumPy, say, <2.4, and NumPy 2.4.0-dev wheels get published in the nightly indices, we won't be able to install those wheels because the dependency resolution will fail with an unsatisfied requirement. If we bump the pin to NumPy <2.5, we run into the risk of having a new version of NumPy release on PyPI sometime later and breaking our code (if we aren't able to notice and fix it in time). It is a very low-risk situation overall, though, since we are both actively making changes to the repository, plus, NumPy makes it easier for us by deprecating things with warnings at least two releases before they get removed, and we as downstream maintainers are urged by them to try out the (one, and in some cases, two) release candidates once and when they are available.

However, without a robust refactoring of the original code, we can expect that there will be problems in the coming months and years. For example, if we see that something breaks, say, in NumPy 2.3.0 in their nightly wheels when we are in the NumPy 2.2.0 stable stage, we'll need to add some code patterns similar to this:

if _np.lib.NumpyVersion(_np.__version__) >= "2.3.0":
    # some logic
elif _np.lib.NumpyVersion(_np.__version__) >= "2.2.0" and _np.lib.NumpyVersion(_np.__version__) < "2.3.0":
    # some other logic
else:
    # yet some more logic

which are more code smells than code patterns, and will surely not be maintainable for us. At the same time, we don't want a lot of reds to appear in our CI either (and not too many if they do). Moreover, it is too early to support NumPy>2, as we discussed in https://github.com/HIPS/autograd/pull/628#pullrequestreview-2255629293.

With that extra context, I'm open to more suggestions on how we can improve our testing, since I'm out of ideas :) We still need to add some testing against the minimum supported NumPy version (1.X) and decide on a lower bound, which I feel can be solved using uv pip install --resolution lowest (see https://docs.astral.sh/uv/concepts/resolution/#lower-bounds). I will be doing it in another PR, of course.

I just triggered CI again with the updated config, and while I can disable the current nightly tests for Python 3.8 and 3.9 since they don't have wheels (we'll drop 3.8 soon in October, anyway), I wonder if you would have suggestions on actual test failures? It could be a NumPy bug or a bug in our own codebase, but I'm not confident about it. I thought it was flaky and re-triggered that test in CI, but that did not seem to help, it fails with the same values.

agriyakhetarpal commented 1 month ago

In d54683477e790a0943abdedefa88245ebdb5eec0, I've disabled the nightly NumPy tests for now, since you've already approved the PR and nothing else is blocking it. I'll merge this for now, and open a new PR to track re-enabling the nightly tests to look at the PyPy Windows failure there. An answer to the above comment on a proposition for our policies would be appreciated, and thank you for the review here, @fjosw!