[ENH] CI is slow - Githubissues

patrickzib commented 1 year ago

CI is slow on some platforms with up to 132 minutes.

Among the slowest tests is sktime/forecasting/tests/test_all_forecasters.py::TestAllForecasters

Screenshot taken from: [test-unix (3.9, macOS-11)]

To start with: can we do something about these forecasting tests? They take >1000 seconds.

patrickzib commented 1 year ago

Another example:

lmmentel commented 1 year ago

Yup this is way too long to be practical. I would say 10 min is closer to a comfortable upper limit but we are far away from that.

I suspect that the way tests are run is partly responsible to the horrendous execution time since the test config has 1400+ LOC and it imports everything and dispatches estimator to test suites (or something like that). One performance penalty here is that all numba functions are compiled which in turn make the test collection step very long. It is also very hard to modify and maintain the testing monstrosity since it's partly centralized and party distributed into individual modules.

I don't know I am able to offer a strategy here to deal with that but I would probably distribute the test to subpackages or individual modules and move away from the centralized runner. The aim would be to enable running different granularity of tests e.g. test only the classification subpackage or a single module file. I don't think this is easy to do now.

Let me know if there are better alternatives.

TonyBagnall commented 1 year ago

the testing is a monster, and personally I would like to scrap it all and go back to something closer to what it was like, which was based on scikit-learn. There are also a lot of possibly unnecessary tests, and a confusion between interface testing and use case testing. It is not feasible to test edge cases in general CI imo, but having different testing parameters seems to be trying to half heartedly do just that. I would like levels of testing

global level tests interface contract in a reasonably light handed manner, and minimises the calls to fit/predict. For example, do not need separate calls to fit to test it return self and that it is idempotent. Can half the calls right there.
correctness testing at the package level (expected output on test case vs observed)
usability testing (installation, large data, weird data etc) off line run in batch mode before release but we need to talk about this

patrickzib commented 1 year ago

Linux is worst now with 3 hours?

MatthewMiddlehurst commented 1 year ago

As discussed in the last meeting pytest-testmon (https://github.com/sktime/sktime/issues/2530) may be a good inclusion. It seems like there are some items which are possibly causing issues, however.

code coverage
pytest-xdist
matrixdesign used to split off tests.

I'm not exactly sure which of these or all of them in combination are causing this trouble, however. There may be some fundamental issues with the test set in play that would need to be fixed (see #271).

The original author @tarpas commented in a recent sktime discussion on the same topic with some interesting ideas (https://github.com/sktime/sktime/discussions/4453#discussioncomment-5584302)

How often will a test which passes on Python 3.7 and 3.11 fail for python 3.10? I think there might be an opportunity to free up some capacity here and not test 3.8-3.10 for each commit, but only for important ones (or periodically)

This would lower the number of jobs required by quite a lot, allowing for more concurrent PRs. But not sure how it would interact with the matrix design. It may just make individual runs take longer on their own.

To improve latency (time to failure) you could then use pytest-split

This may be a good alternative to pytest-xdist, though I don't know if it will interact any better with pytest-testmon.

An option could be to have two set of tests, a minimal set and a full set:

Minimal set: Used for PRs. Uses pytest-testmon, no code coverage, min and max Python version only (for now 3.8 and 3.10, maybe even just one). May need extra changes for compatibility with pytest-testmon i.e. no matrixdesign.
Full set: Ran periodically rather than on PRs. All Python versions, no matrixdesign, additional run which disables numba to capture code coverage on all tests (currently numba does not play well with it).

MatthewMiddlehurst commented 1 year ago

This of course is just one approach, and either way we would have to tackle the extreme test time currently seen in the test suite (i.e. #154).

TonyBagnall commented 8 months ago

we are down to about 30 mins now, and it seems fine to me at the moment. I will close this, then we can open more specific testing/CI issues. This would be fantastic though https://github.com/aeon-toolkit/aeon/pull/849

aeon-toolkit / aeon

[ENH] CI is slow #59