ai2cm / fv3net

explore the FV3 data for parameterization
MIT License
16 stars 3 forks source link

fv3fit unit tests are slow #2057

Open oliverwm1 opened 1 year ago

oliverwm1 commented 1 year ago

fv3fit unit tests are slow. They take about 10min to run on my Mac and more like 15-20min on CI. Development would be easier if the tests ran faster.

In #2055 I added output for the duration of the slowest tests. For fv3fit tests run on CI (see here) they are:

========================== slowest 30 test durations ===========================
261.53s call     external/fv3fit/tests/training/test_autoencoder.py::test_autoencoder
135.89s call     external/fv3fit/tests/training/test_graph.py::test_train_graph_network[UNet]
31.59s call     external/fv3fit/tests/training/test_train.py::test_train_default_model_on_identity[precipitative]
22.87s call     external/fv3fit/tests/emulation/test_train_microphysics.py::test_training_entry_integration
21.99s call     external/fv3fit/tests/training/test_train.py::test_train_default_model_on_nonstandard_identity[precipitative]
17.43s call     external/fv3fit/tests/training/test_train_novelty_detection.py::test_train_novelty_default_correct_output[ocsvm_novelty_detector]
16.32s call     external/fv3fit/tests/training/test_train.py::test_train_default_model_on_nonstandard_identity[dense]
15.78s call     external/fv3fit/tests/training/test_train_novelty_detection.py::test_train_novelty_default_extreme_novelties[ocsvm_novelty_detector]
15.17s call     external/fv3fit/tests/training/test_train.py::test_train_default_model_on_identity[dense]
11.90s call     external/fv3fit/tests/training/test_train.py::test_train_with_same_seed_gives_same_result[precipitative]
11.29s call     external/fv3fit/tests/training/test_train.py::test_dump_and_load_default_maintains_prediction[precipitative]
11.25s call     external/fv3fit/tests/training/test_main.py::test_cli[False-True-convolutional]
10.86s call     external/fv3fit/tests/training/test_main.py::test_cli[True-False-convolutional]
10.42s call     external/fv3fit/tests/training/test_main.py::test_cli[False-True-precipitative]
10.39s call     external/fv3fit/tests/training/test_cyclegan.py::test_cyclegan_runs_without_errors
10.38s call     external/fv3fit/tests/training/test_main.py::test_cli[True-False-precipitative]
10.30s call     external/fv3fit/tests/training/test_autoencoder.py::test_autoencoder_overfit
10.03s call     external/fv3fit/tests/training/test_main.py::test_cli[False-False-convolutional]
9.72s call     external/fv3fit/tests/training/test_main.py::test_cli[True-False-dense]
9.61s call     external/fv3fit/tests/training/test_main.py::test_cli[False-False-precipitative]
8.89s call     external/fv3fit/tests/training/test_main.py::test_cli[False-True-dense]
8.88s call     external/fv3fit/tests/training/test_main.py::test_cli[False-False-dense]
8.72s call     external/fv3fit/tests/training/test_train.py::test_train_with_same_seed_gives_same_result[dense]
7.94s call     external/fv3fit/tests/training/test_main.py::test_cli[False-False-random_forest]
7.75s call     external/fv3fit/tests/emulation/test_models.py::test_saved_model_jacobian
7.71s call     external/fv3fit/tests/training/test_main.py::test_cli[False-True-random_forest]
7.45s call     external/fv3fit/tests/training/test_main.py::test_cli[True-False-random_forest]
6.81s call     external/fv3fit/tests/test_out_of_sample.py::test_out_of_sample_identity_same_output_when_in_sample
6.57s call     external/fv3fit/tests/training/test_train.py::test_dump_and_load_default_maintains_prediction[transformed]
6.21s call     external/fv3fit/tests/training/test_train.py::test_predict_does_not_mutate_input[precipitative]

Some individual tests are very slow (test_autoencoder!) and others (e.g. test_cli) are probably run for too many different parameters.

Related: #2040

oliverwm1 commented 1 year ago

I'm surprised to see some tests from test_train_novelty_detection in here. I can work to speed or eliminate those.

nbren12 commented 1 year ago

Many of these test are marked with slow so can be skipped with

pytest -m 'not slow'

This runs in about 80s on my VM. Though coverage drops significantly for certain parts of the code base.

Here are the slowest 'not slow' tests:

=================================================================================================================================================================== slowest 10 test durations ===================================================================================================================================================================
16.85s call     external/fv3fit/fv3fit/pytorch/cyclegan/test_cyclegan.py::test_cyclegan_runs_without_errors
2.22s call     external/fv3fit/tests/test_random_forest.py::test_random_forest_predict_dim_size[outputs0]
2.10s call     external/fv3fit/tests/keras/test_pure_keras.py::test_PureKerasDictPredictor_dump_load
1.93s call     external/fv3fit/tests/test_spectral_normalization.py::test_keras
1.52s call     external/fv3fit/tests/keras/test_convolutional_network.py::test_convolutional_network_build_standard_input_gives_standard_output
0.98s call     external/fv3fit/tests/emulation/test_models.py::test_save_and_reload_transformed_model
0.81s call     external/fv3fit/tests/emulation/test_keras.py::test_train_loss_integration
0.80s call     external/fv3fit/tests/emulation/test_keras.py::test_train
0.80s call     external/fv3fit/tests/keras/test_upstream_apis.py::test_train_keras_with_dict_output[False-False]
0.78s call     external/fv3fit/tests/emulation/layers/test_architectures.py::test_dense_local_is_local[8]
mcgibbon commented 1 year ago

When I first made the slow markers I marked every test over 1s long, should definitely add the cyclegan test and possibly the others to that mark.

The somewhat ludicrous test time for autoencoder and cyclegan is something we could talk about @oliverwm1. These run much faster on my machine, but apparently only because I have M1 acceleration. Solutions include:

nbren12 commented 1 year ago

One issue to keep in mind is that the coverage drops a lot when excluding the slow tests (adding a marker for the cyclegan test. Here is the coverage of subdirs:

fv3fit/emulation 90%
fv3fit/pytorch 48%
fv3fit/keras     76%
fv3fit/_shared   87%
fv3fit/data    63%
fv3fit/sklearn 63%

coverage.tar.gz

oliverwm1 commented 1 year ago

Thanks @mcgibbon @nbren12. Agreed running the 'not slow' tests locally is a good idea for local dev. But as Noah points out we are relying on the slow tests for a lot of our coverage. Can we be more judicious about these tests (i.e. maintain similar coverage with fewer slow tests) or speed them up? I do think the long test time on CI is increasing friction for reviewing/merging PRs but I would also be a bit nervous about only doing slow tests infrequently.

mcgibbon commented 1 year ago

I find almost all of the time a test that only runs the training code but doesn't check how well it trained will catch bugs introduced into the training codes (and have identical coverage, technically). The "skill" tests only need to be run when those training functions are edited, so we could treat them as a trigger and main test similar to the argo integration tests.

I did try hard to decrease the run time of the pytorch training tests, but pytorch is generally very slow on cpu because of how it runs python code every training epoch, and there's only so much we can do about it. The architectures implemented also just need more data/epochs to train.

nbren12 commented 1 year ago

Would it be worth writing more unit tests @mcgibbon of e.g. the model objects etc? top-level training tests can be hard to make fast, though I agree that they do tend to find bugs easily.

mcgibbon commented 1 year ago

There's a tradeoff with that solution - It's hard to come up with these tests, and if you do rely extensively on unit tests of components of the model building / training functions, it means you will frequently need to refactor or update tests when you're only changing internal implementation details of the model, including style refactors. I don't know if it even can be done in this case, the best I could think of for sub-testing CycleGAN was to include the autoencoder which is already expensive to train.

The best answer may just be "training global models is expensive and tests of this will be expensive".

nbren12 commented 1 year ago

Seems like the generator and discriminator could be unit tested relatively easily. E.g. returns the right shape, o(1) outputs, or gradient of output with respect to input is nonzero...etc. Presumably these are being saved as artifacts so one would want to maintain some kind of backwards compatibility. These type of tests are fast and usually give a nice amount of coverage even if they aren't fully comprehensive.

The glue/config/factory/training function code can be bit tricker to test effectively than the basic architectures/models, but often the typechecker can find those kind of integration bugs.