Closed patrickzib closed 8 months ago
Another example:
Yup this is way too long to be practical. I would say 10 min is closer to a comfortable upper limit but we are far away from that.
I suspect that the way tests are run is partly responsible to the horrendous execution time since the test config has 1400+ LOC and it imports everything and dispatches estimator to test suites (or something like that). One performance penalty here is that all numba functions are compiled which in turn make the test collection step very long. It is also very hard to modify and maintain the testing monstrosity since it's partly centralized and party distributed into individual modules.
I don't know I am able to offer a strategy here to deal with that but I would probably distribute the test to subpackages or individual modules and move away from the centralized runner. The aim would be to enable running different granularity of tests e.g. test only the classification
subpackage or a single module file. I don't think this is easy to do now.
Let me know if there are better alternatives.
the testing is a monster, and personally I would like to scrap it all and go back to something closer to what it was like, which was based on scikit-learn. There are also a lot of possibly unnecessary tests, and a confusion between interface testing and use case testing. It is not feasible to test edge cases in general CI imo, but having different testing parameters seems to be trying to half heartedly do just that. I would like levels of testing
Linux is worst now with 3 hours?
As discussed in the last meeting pytest-testmon
(https://github.com/sktime/sktime/issues/2530) may be a good inclusion. It seems like there are some items which are possibly causing issues, however.
pytest-xdist
matrixdesign
used to split off tests.I'm not exactly sure which of these or all of them in combination are causing this trouble, however. There may be some fundamental issues with the test set in play that would need to be fixed (see #271).
The original author @tarpas commented in a recent sktime discussion on the same topic with some interesting ideas (https://github.com/sktime/sktime/discussions/4453#discussioncomment-5584302)
How often will a test which passes on Python 3.7 and 3.11 fail for python 3.10? I think there might be an opportunity to free up some capacity here and not test 3.8-3.10 for each commit, but only for important ones (or periodically)
This would lower the number of jobs required by quite a lot, allowing for more concurrent PRs. But not sure how it would interact with the matrix design. It may just make individual runs take longer on their own.
To improve latency (time to failure) you could then use pytest-split
This may be a good alternative to pytest-xdist
, though I don't know if it will interact any better with pytest-testmon
.
An option could be to have two set of tests, a minimal set and a full set:
pytest-testmon
, no code coverage, min and max Python version only (for now 3.8 and 3.10, maybe even just one). May need extra changes for compatibility with pytest-testmon
i.e. no matrixdesign
.matrixdesign
, additional run which disables numba to capture code coverage on all tests (currently numba does not play well with it).This of course is just one approach, and either way we would have to tackle the extreme test time currently seen in the test suite (i.e. #154).
we are down to about 30 mins now, and it seems fine to me at the moment. I will close this, then we can open more specific testing/CI issues. This would be fantastic though https://github.com/aeon-toolkit/aeon/pull/849
CI is slow on some platforms with up to 132 minutes.
Among the slowest tests is
sktime/forecasting/tests/test_all_forecasters.py::TestAllForecasters
Screenshot taken from:
[test-unix (3.9, macOS-11)]
To start with: can we do something about these forecasting tests? They take >1000 seconds.