too many test failures for PyTorch/1.12.0-foss-2022a-CUDA-11.7.0

On both our V100 (Intel Cascade Lake) and A100 (AMD Milan) systems (both RHEL 8.4 currently), I'm seeing too many test failures for PyTorch/1.12.0-foss-2022a-CUDA-11.7.0.

On both systems, I get Too many failed tests (437), maximum allowed is 400 with:

WARNING: 285 test failures, 152 test errors (out of 86678):
distributions/test_constraints (2 failed, 128 passed, 2 skipped, 2 warnings)
distributed/fsdp/test_distributed_checkpoint (2 total tests, failures=2)
distributed/fsdp/test_fsdp_apply (3 total tests, failures=3)
distributed/fsdp/test_fsdp_input (2 total tests, failures=2)
distributed/fsdp/test_fsdp_meta (14 total tests, failures=14)
distributed/fsdp/test_fsdp_misc (9 total tests, failures=9)
distributed/fsdp/test_fsdp_mixed_precision (90 total tests, failures=88)
distributed/fsdp/test_fsdp_state_dict (61 total tests, failures=61)
distributed/fsdp/test_fsdp_summon_full_params (73 total tests, failures=65)
distributions/test_distributions (219 total tests, failures=1)
test_autograd (484 total tests, failures=1, skipped=16, expected failures=2)
test_fx (924 total tests, errors=10, skipped=190, expected failures=6)
test_jit (2661 total tests, failures=12, errors=7, skipped=135, expected failures=7)
test_jit_cuda_fuser (147 total tests, errors=1, skipped=19)
test_jit_legacy (2661 total tests, failures=12, errors=8, skipped=133, expected failures=7)
test_jit_profiling (2661 total tests, failures=12, errors=7, skipped=135, expected failures=7)
test_ops_gradients (6968 total tests, errors=1, skipped=3597, expected failures=85)
test_package (131 total tests, errors=46, skipped=23)
test_quantization (877 total tests, failures=3, errors=40, skipped=51)
test_reductions (2895 total tests, errors=5, skipped=104, expected failures=49)
test_sort_and_select (91 total tests, errors=1, skipped=13)
test_sparse (1268 total tests, errors=1, skipped=131)
test_tensor_creation_ops (546 total tests, errors=25, skipped=60)

That seems to be significantly more than what @casparvl and @smoors observed in #15924 (although not all test reports were using the enhanced PyTorch easyblock from https://github.com/easybuilders/easybuild-easyblocks/pull/2803 which counts failing tests correctly, I guess), so I'm a bit puzzled here...

@Flamefire Do some of these failing tests happen to run a bell for you? In #15924 you mentioned that you have some patches lined up for PyTorch 1.12.x (but perhaps we need to get #16453 and #16484 merged first?).

@boegel at your request, on our system that contains 4x A100 per node and intel CPU:

== 2022-10-30 05:06:46,214 pytorch.py:344 WARNING 41 test failures, 152 test errors (out of 86678):
distributions/test_constraints (2 failed, 128 passed, 2 skipped, 2 warnings)
distributed/_shard/sharded_tensor/test_sharded_tensor (58 total tests, errors=1)
distributions/test_distributions (219 total tests, failures=1)
test_fx (924 total tests, errors=10, skipped=190, expected failures=6)
test_jit (2661 total tests, failures=12, errors=7, skipped=127, expected failures=7)
test_jit_cuda_fuser (147 total tests, errors=1, skipped=18)
test_jit_legacy (2661 total tests, failures=12, errors=8, skipped=125, expected failures=7)
test_jit_profiling (2661 total tests, failures=12, errors=7, skipped=127, expected failures=7)
test_package (131 total tests, errors=46, skipped=23)
test_quantization (877 total tests, failures=3, errors=40, skipped=47)
test_reductions (2895 total tests, errors=5, skipped=104, expected failures=49)
test_sort_and_select (91 total tests, errors=1, skipped=13)
test_sparse (1268 total tests, errors=1, skipped=129)
test_tensor_creation_ops (546 total tests, errors=25, skipped=60)
test_torch (853 total tests, failures=1, skipped=65)

On our other cluster, we have 4x Titan V's in our build node, and the test suite produced:

== 2022-10-30 17:34:45,721 pytorch.py:344 WARNING 39 test failures, 146 test errors (out of 86183):
distributions/test_distributions (219 total tests, failures=1, skipped=5)
test_fx (924 total tests, errors=10, skipped=190, expected failures=6)
test_jit (2656 total tests, failures=12, errors=7, skipped=174, expected failures=7)
test_jit_legacy (2656 total tests, failures=12, errors=8, skipped=169, expected failures=7)
test_jit_profiling (2656 total tests, failures=12, errors=7, skipped=174, expected failures=7)
test_package (131 total tests, errors=46, skipped=23)
test_quantization (877 total tests, failures=2, errors=36, skipped=75)
test_reductions (2895 total tests, errors=5, skipped=104, expected failures=49)
test_sort_and_select (91 total tests, errors=1, skipped=13)
test_sparse (1268 total tests, errors=1, skipped=133)
test_tensor_creation_ops (546 total tests, errors=25, skipped=60)

@Flamefire Do some of these failing tests happen to run a bell for you? In https://github.com/easybuilders/easybuild-easyconfigs/pull/15924 you mentioned that you have some patches lined up for PyTorch 1.12.x (but perhaps we need to get https://github.com/easybuilders/easybuild-easyconfigs/pull/16453 and https://github.com/easybuilders/easybuild-easyconfigs/pull/16484 merged first?).

Yes: PyTorch 1.12 is not compatible with Python 3.10 yet, so most of the test failures are real and caused by that incompatibility.

So #16453 fixes a bunch of actual failures especially related to PPC but also a few others, while #16484 (still working on the last 2 tests) has patches fixing the Python 3.10 (and also CUDA 11.7) compatibility and the ones from the former PR.

On that topic: I really liked the old way of reporting failing test suites/files (e.g. "test_jit_profiling") better because during the work on the above 2 I noticed that many sub-test failures (i.e. in the same file) can be fixed by a single patch. So that output was IMO more useful for investigation and reproduction (manually) and we can exclude whole test suites/files with the EC param I added long ago.

So I would:

merge #16453 which is ready
drop PyTorch-1.12.0-foss-2022a-CUDA-11.7.0.eb which has serious bugs in favor of the 1.12.1 which I'm about to finish

@Flamefire How come @casparvl isn't seeing a whole bunch of those errors though, if they largely boil down to incompatibilities with Python 3.10? We're also not seeing those errors for the CPU-only installations of PyTorch/1.12.0-foss-2022a:

== 2022-11-22 19:33:17,511 pytorch.py:344 WARNING 39 test failures, 147 test errors (out of 86167):
distributions/test_distributions (219 total tests, failures=1, skipped=5)
test_fx (924 total tests, errors=10, skipped=190, expected failures=6)
test_jit (2656 total tests, failures=12, errors=7, skipped=174, expected failures=7)
test_jit_legacy (2656 total tests, failures=12, errors=8, skipped=169, expected failures=7)
test_jit_profiling (2656 total tests, failures=12, errors=7, skipped=174, expected failures=7)
test_package (131 total tests, errors=46, skipped=23)
test_quantization (877 total tests, failures=2, errors=37, skipped=75)
test_reductions (2895 total tests, errors=5, skipped=104, expected failures=49)
test_sort_and_select (91 total tests, errors=1, skipped=13)
test_sparse (1268 total tests, errors=1, skipped=133)
test_tensor_creation_ops (546 total tests, errors=25, skipped=60)

That shows a very similar result to what @casparvl shared, yet for our GPU installations I'm getting way more failing tests... @casparvl Did you do the installation with a single GPU available (in a Slurm job), or with all 4 GPUs available for running the tests?

The CPU version may avoid running into the code intended for Python 3.10. Also as some tests depend on the number of GPUs there may be more or less such failures. For the rest I'd need more info but I'd say try out the version I'm currently fixing once it is ready and see if the failures are gone before we spend time guessing.

FWIW: This is from my working document listing the failing tests (i.e. files) and which patch fixes it:

    * distributed/fsdp/test_fsdp_pure_fp16      - PyTorch-1.11.0_fix-fsdp-fp16-test.patch
    * distributed/rpc/cuda/test_tensorpipe_agent - ?
    * distributed/rpc/test_tensorpipe_agent     - PyTorch-1.12.1_fix-use-after-free-in-tensorpipe-agent.patch
    * distributions/test_constraints
    * distributions/test_distributions          - PyTorch-1.12.1_fix-test_wishart_log_prob.patch
    * test_ao_sparsity                          - PyTorch-1.12.1_skip-ao-sparsity-test-without-fbgemm.patch
    * test_cpp_extensions_aot_no_ninja          - PyTorch-1.12.1_fix-cuda-gcc-version-check.patch
    * test_cpp_extensions_jit                   - PyTorch-1.12.1_fix-test_cpp_extensions_jit.patch
    * test_fx                                   - PyTorch-1.12.1_python-3.10-compat.patch
    * test_jit_cuda_fuser                       - PyTorch-1.12.1_fix-TestCudaFuser.test_unary_ops.patch
    * test_jit_legacy                           - PyTorch-1.12.1_python-3.10-annotation-fix.patch
    * test_jit_profiling                        - PyTorch-1.12.1_python-3.10-annotation-fix.patch
    * test_jit                                  - PyTorch-1.12.1_python-3.10-annotation-fix.patch
    * test_model_dump                           - PyTorch-1.10.0_fix-test-model_dump.patch
    * test_nn                                   - PyTorch-1.12.1_fix-vsx-loadu.patch
    * test_ops_gradients                        - PyTorch-1.12.1_skip-failing-grad-test.patch
    * test_ops                                  - PyTorch-1.12.1_increase-tolerance-test_ops.patch
    * test_optim                                - PyTorch-1.12.1_increase-test-adadelta-tolerance.patch
    * test_quantization                         - PyTorch-1.12.1_python-3.10-annotation.patch
    * test_reductions                           - PyTorch-1.12.1_python-3.10-compat.patch
    * test_sort_and_select                      - PyTorch-1.12.1_python-3.10-compat.patch
    * test_sparse                               - PyTorch-1.12.1_python-3.10-compat.patch
    * test_tensor_creation_ops                  - PyTorch-1.12.1_python-3.10-compat.patch
    * test_torch                                - PyTorch-1.12.1_fix-TestTorch.test_to.patch
    * test_unary_ufuncs                         - PyTorch-1.12.1_fix-vsx-vector-funcs.patch

easybuilders / easybuild-easyconfigs

too many test failures for PyTorch/1.12.0-foss-2022a-CUDA-11.7.0 #16733