Migrate unit tests to pytest?

kalekundert commented 10 months ago

While I was working on #78, the test suite caused me a lot of trouble. There were two main problems:

Running the tests takes a really long time. I ended up writing some scripts to run the tests on my institution's cluster, but even then ≈50 of the tests (individual test functions, not test files) run up against a 4h time limit. The whole suite takes >34h to run, not counting any tests that were killed for exceeding the time limit. There are also a handful of tests that require >32 GB of RAM!
There are ≈30 tests that are either broken or flaky. This makes it a bit scary to develop new features, because you're not sure which test failures you're actually responsible for. And of course, even if you realize that you're not responsible for a particular failure, you have no way of knowing if you broke something relating to these tests. I do think this relates back to the first problem, because if it were practical to run the whole test suite for each commit, there would surely be fewer broken tests.

Migrating to pytest wouldn't solve the underlying issue, but it has a few features that I think would help a lot. The biggest is the ability to "mark" tests. There are two marks that I'd want to use extensively. The first is xfail, which indicates that a test is currently broken and expected to fail. The failures are still reported, so you know they're happening, but they don't affect whether or not the whole suite is considered to pass or fail. This means that you can require new contributions to not break any currently working tests, while not having to fix all the broken/flaky tests up front. The second mark is slow, which would indicate that a test takes more than a few seconds to complete. With this information, GitHub could be configured to automatically run all the "not slow" tests for each PR, which would help protect against breakages. The slow tests could either be gradually made fast enough to include in the automated test suite, or just be run manually when there's some reason to.

Pytest also supports running tests in parallel across multiple processes, via the pytest-xdist plugin. This alone wouldn't be enough to make the tests run in a reasonable amount of time, but it would help. Finally, pytest makes it easier to share code between tests (because it doesn't require every test to belong to a class) and has features like parametrization that can cut down on boilerplate. These last two things wouldn't make the tests run any faster, but would make them easier to write.

Is this something you'd be in favor of? I'm not asking for you to do anything yourself, I'm more asking for permission to do something like this the next time I want to make a non-trivial change to the code. (With no guarantees that I'll ever actually get around to it.)

In case it's of any interest, here's the test report I generated for the most recent commit (94380ef):

RUNTIME: 1 day, 10:31:15

TOTAL:   864
PASS:    771
FAIL:     39
KILLED:   54

FAILURES:
  AssertionError: ((24, 24, 10, 10, 10), 9)
      nn.test_convolution.TestConvolution.test_padding_modes_r3conv

  AssertionError: False is not true...
      group.test_combine_subgroups.TestSubgroups.test_o3
      group.test_subgroups.TestSubgroups.test_restrict_so3_o2

  AssertionError: Induction from general representations is not supported yet
      nn.test_pooling.TestPooling.test_induced_norm_pooling

  AssertionError: The error found during equivariance check with element...
      nn.test_r3upsampling.TestUpsampling.test_cyclic_even_nearest
      nn.test_r2upsampling.TestUpsampling.test_cyclic_even_nearest
      nn.test_r3upsampling.TestUpsampling.test_o3_nearest
      nn.test_r2upsampling.TestUpsampling.test_cyclic_odd_nearest
      nn.test_r3upsampling.TestUpsampling.test_dihedral_even_nearest
      nn.test_r3upsampling.TestUpsampling.test_ico_nearest
      nn.test_r3upsampling.TestUpsampling.test_cone_nearest
      nn.test_r2upsampling.TestUpsampling.test_dihedral_odd_bilinear
      nn.test_r3upsampling.TestUpsampling.test_dihedral_nearest
      nn.test_tensorproduct.TestTensor.test_with_spatial_dimns
      nn.test_r3upsampling.TestUpsampling.test_so2_nearest
      nn.test_r3upsampling.TestUpsampling.test_so3_trilinear
      nn.test_r2upsampling.TestUpsampling.test_cyclic_odd_bilinear
      nn.test_r3upsampling.TestUpsampling.test_dihedral_odd_nearest
      nn.test_r3upsampling.TestUpsampling.test_inv_nearest
      nn.test_r3upsampling.TestUpsampling.test_so3_nearest
      nn.test_r3upsampling.TestUpsampling.test_cube_nearest
      nn.test_r2upsampling.TestUpsampling.test_so2_nearest
      nn.test_r2upsampling.TestUpsampling.test_dihedral_even_nearest
      nn.test_r2upsampling.TestUpsampling.test_dihedral_odd_nearest
      nn.test_r2upsampling.TestUpsampling.test_o2_nearest
      nn.test_convolution.TestConvolution.test_ico_sparse

  NotImplementedError
      group.test_subgroups.TestSubgroups.test_restrict_ico_dn
      group.test_subgroups.TestSubgroups.test_restrict_ico_tetra
      group.test_combine_subgroups.TestSubgroups.test_ico
      group.test_subgroups.TestSubgroups.test_restrict_octa_dn
      group.test_subgroups.TestSubgroups.test_restrict_octa_tetra

  RuntimeError: Found no NVIDIA driver on your system...
      nn.test_amp.TestMixedPrecision.test_r2conv_uniform
      nn.test_amp.TestMixedPrecision.test_r3pointconv_non_uniform
      nn.test_amp.TestMixedPrecision.test_r2conv_non_uniform
      nn.test_amp.TestMixedPrecision.test_gnorm
      nn.test_amp.TestMixedPrecision.test_r3conv_non_uniform
      nn.test_amp.TestMixedPrecision.test_fourier
      nn.test_amp.TestMixedPrecision.test_fieldnorm

  RuntimeError: could not construct a memory descriptor using a format tag
      nn.test_export.TestExport.test_R2ConvTransposed

The upsampling equivariance tests are the ones that are flaky. The mixed precision tests probably aren't really failures, I just wasn't running on nodes with GPUs.

To end on a more positive note, I should say that I really appreciate how extensive the test suite is. There are so many academic projects out there with not enough tests, but this is the only one I know of with too many tests! It's a good problem to have, I think.

Gabri95 commented 10 months ago

Hi @kalekundert

Thanks a lot for opening this issue! This is indeed quite annoying and I often end up waiting a long time to rerun all tests before pushing new features.

I am not familiar with pytest but it seems very convenient from what you say, so I support adopting this solution!

Let me comment briefly on the failed tests:

all mixed precision tests indeed fail since you are not running on GPU
some test-restrict fail since some of those subgroups have indeed not been implemented. These tests are expected to fail and should be marked as such
TestUpsampling is a tricky one since it's only approximately equivariant. Nearest upsampling is expected to fail in general. Bilinear usually work, but it is still quite noisy so it might fail in a few random cases.
nn.test_convolution.TestConvolution.test_ico_sparse this checks the equivariance of convolution with a non-bandlimited basis, so it is expected to have a rather large error. This should also be marked
nn.test_pooling.TestPooling.test_induced_norm_pooling this is indeed a feature not implemented yet, although it is not particularly interesting. It is expected to fail

The following failures are suspicious, though. I will take a look, thanks for reporting that!

nn.test_tensorproduct.TestTensor.test_with_spatial_dimns
group.test_combine_subgroups.TestSubgroups.test_o3, group.test_subgroups.TestSubgroups.test_restrict_so3_o2
nn.test_convolution.TestConvolution.test_padding_modes_r3conv

To end on a more positive note, I should say that I really appreciate how extensive the test suite is. There are so many academic projects out there with not enough tests, but this is the only one I know of with too many tests! It's a good problem to have, I think.

Ahah nice, thanks, I really appreciate this comment! 😄

psteinb commented 10 months ago

I personally would favor a port to pytest too. Let me know, if any help would be needed.

psteinb commented 10 months ago

Another aspect I like with vanilla pytest is, that tests can be marked. This way, I can isolate GPU-only tests from short runtime tests. The latter are usually the default and will be run for every commit (for the sake of rapid feedback). The GPU unit tests I then schedule to run once a day or once a week.

See https://docs.pytest.org/en/latest/how-to/mark.html

kalekundert commented 10 months ago

I don't have any immediate plans to work on this, so feel free to start working on it yourself if you want to. I actually don't think it'd be that big a change. Pytest can run unittest-style tests, so you wouldn't have to rewrite any code. Just add some marks and configure CI to run pytest.

If you do want to work on this, one thing that might be useful is this JSON file I compiled. It's just a dictionary where the keys are test names and the values are how long it takes for that test to run, in seconds. My intention was to use this to quickly mark all the "slow" tests. Some tests will be missing from this dictionary. Those are likely, but not necessarily, the ones that took >4h to complete.

QUVA-Lab / escnn

Migrate unit tests to pytest? #79