Closed agriyakhetarpal closed 1 month ago
I now see why the PyPy tests were taking a long time – SciPy is being built from the source distribution because it doesn't publish wheels for PyPy, while NumPy does, of course (hence I'm myself answering the question I had). To keep CI time low, I can add another nox
session for PyPy (edit: or I'll just keep them in the same session, but maybe add a separate job for better separation of logs?)
It looks like some of the tests have not been entirely deterministic, and failures with differences in precision between macOS Intel vs M-series devices were being reported (only in CI, though, I didn't face any of them locally). Setting a fixed random seed seems to have resolved the problem. This PR enables testing against both Intel and ARM macOS platforms, which we can keep until Intel devices surpass their EoL date.
Just added some additional changes that I had missed – this is ready for review whenever you have the time, @fjosw and @j-towns! I separated out the SciPy tests so that they run on non-PyPy runners in 19dab2a00845f4d4a3b86bb66eaedfac8227ef2f, so that we get faster CI. The pre-commit check will fail until #634 gets merged, so we can ignore that safely.
I have also added tests against NumPy and SciPy nightlies. Both of them support Python 3.10–3.13, which explains the failure across platforms for Python 3.8 and 3.9. We will be dropping 3.8 sometime soon in October anyway, but in the meantime, we could explore ways to make them fail more gracefully somehow or perhaps limit our nightly testing against the Python versions that are supported or the latest Python version. Please let me know your thoughts here.
Thanks for going through the changes, @fjosw! After all, nox
is essentially the same as tox
, but the Pythonic API might have made it at least slightly easier to read and maintain your own sessions, I hope!
I'm not sure if I have a robust answer for how robust the nightly tests will be – I think it depends a lot on our availability to ensure that we can land fixes at the right time since breakages will be critical to our users. At the same time, if we implement an upper pin on NumPy, it could, in rare circumstances, conflict with our ability to run the nightly tests, too. Say, if we have a pin on NumPy, say, <2.4, and NumPy 2.4.0-dev wheels get published in the nightly indices, we won't be able to install those wheels because the dependency resolution will fail with an unsatisfied requirement. If we bump the pin to NumPy <2.5, we run into the risk of having a new version of NumPy release on PyPI sometime later and breaking our code (if we aren't able to notice and fix it in time). It is a very low-risk situation overall, though, since we are both actively making changes to the repository, plus, NumPy makes it easier for us by deprecating things with warnings at least two releases before they get removed, and we as downstream maintainers are urged by them to try out the (one, and in some cases, two) release candidates once and when they are available.
However, without a robust refactoring of the original code, we can expect that there will be problems in the coming months and years. For example, if we see that something breaks, say, in NumPy 2.3.0 in their nightly wheels when we are in the NumPy 2.2.0 stable stage, we'll need to add some code patterns similar to this:
if _np.lib.NumpyVersion(_np.__version__) >= "2.3.0":
# some logic
elif _np.lib.NumpyVersion(_np.__version__) >= "2.2.0" and _np.lib.NumpyVersion(_np.__version__) < "2.3.0":
# some other logic
else:
# yet some more logic
which are more code smells than code patterns, and will surely not be maintainable for us. At the same time, we don't want a lot of reds to appear in our CI either (and not too many if they do). Moreover, it is too early to support NumPy>2, as we discussed in https://github.com/HIPS/autograd/pull/628#pullrequestreview-2255629293.
With that extra context, I'm open to more suggestions on how we can improve our testing, since I'm out of ideas :) We still need to add some testing against the minimum supported NumPy version (1.X) and decide on a lower bound, which I feel can be solved using uv pip install --resolution lowest
(see https://docs.astral.sh/uv/concepts/resolution/#lower-bounds). I will be doing it in another PR, of course.
I just triggered CI again with the updated config, and while I can disable the current nightly tests for Python 3.8 and 3.9 since they don't have wheels (we'll drop 3.8 soon in October, anyway), I wonder if you would have suggestions on actual test failures? It could be a NumPy bug or a bug in our own codebase, but I'm not confident about it. I thought it was flaky and re-triggered that test in CI, but that did not seem to help, it fails with the same values.
In d54683477e790a0943abdedefa88245ebdb5eec0, I've disabled the nightly NumPy tests for now, since you've already approved the PR and nothing else is blocking it. I'll merge this for now, and open a new PR to track re-enabling the nightly tests to look at the PyPy Windows failure there. An answer to the above comment on a proposition for our policies would be appreciated, and thank you for the review here, @fjosw!
Description
This PR adds several changes to the testing infrastructure:
tox
's DSL to Pythonicnox
, with three sessions: one for linting, one for tests, and one to build a source distribution + wheel and validate them. This is in line with the earliertox
environments, with a few minor changes, and a major one in that SciPy is installed as a dependency while testing instead of having a separate environment for it. The linting session hasn't been activated yet because that will raise several warnings to fix throughout the codebase.concurrency:
key.tox -e test
andtox-e test-scipy
. The tests for macOS and Windows for PyPy are excluded from the GHA matrix instead, because SciPy does not yet provide wheels for PyPy and installing it from source is lengthy (and while still easy, ever so slightly trickier on Windows and macOS as compared to Linux – we could set up a Miniconda environment to test them if needed).uv
is used in thenox
sessions and in CI, using https://github.com/yezz123/setup-uv. We don't need any extra caching for our dependencies, because it is faster to download them rather than to retrieve them from a cache.[test]
set of extras has been added, with a minimal set of test dependencies.pytest-xdist
to run tests in parallel, which completes the tests in 4 seconds on my machine, compared to ~35 seconds in serial mode. This is a ~9x speedup! We can notice similar gains in CI.nox -s nightly-tests
uses environment variables fromuv
to download nightly versions of NumPy (and SciPy where applicable) and installs them before testing. Jobs have been included for this purpose in the CI matrix.There are some more changes needed across the codebase, such as fixing some missing coverage and either using CodeCov or just using an HTML coverage report (see https://hynek.me/articles/ditch-codecov-python/), so that the coverage tracking from the tests is actually made use of. Some of the missing coverage came from things that were associated with Python 2 code, though, so it might be good to remove those entirely once we get to them as a part of linting and autoformatting the entire codebase.
Closes #631 and related to #630