🐛[BUG]: CUDA Graph capture failures during multi-node DDP runs

Version

1.1.0

On which installation method(s) does this occur?

Docker

Describe the issue

The multi-node run fail during the CUDA Graph capture due to NCCL watchdog thread errors. The error logs look something like below:

[E ProcessGroupNCCL.cpp:830] [Rank 10] NCCL watchdog thread terminated with exception: CUDA error: operation not permitted when stream is capturing
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7fe97f5b295e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7fe97f56b69d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fe994fd7e12 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x90 (0x7fe90b6dca20 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fe90b6e1708 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11b (0x7fe90b6e602b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x94 (0x7fe90b6e63d4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc2b3 (0x7fe94fcb22b3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94b43 (0x7fe9966a8b43 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126a00 (0x7fe99673aa00 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 10] NCCL watchdog thread terminated with exception: CUDA error: operation not permitted when stream is capturing
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This is mostly due to the following issue: https://github.com/pytorch/pytorch/pull/104487#issuecomment-1638665876

A current workaround is to add a time delay between the warmup and the start of capture to allow the NCCL watchdogs to clean up work before starting the capture. This workaround will not be required after the Pytorch base container version used for Modulus is updated to 23.07.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

Other/Misc.

No response

NVIDIA / modulus-sym

🐛[BUG]: CUDA Graph capture failures during multi-node DDP runs #47

Version

On which installation method(s) does this occur?

Describe the issue

Minimum reproducible example

Relevant log output

Environment details

Other/Misc.