Framework providing pythonic APIs, algorithms and utilities to be used with Modulus core to physics inform model training as well as higher level abstraction for domain experts
A current workaround is to add a time delay between the warmup and the start of capture to allow the NCCL watchdogs to clean up work before starting the capture. This workaround will not be required after the Pytorch base container version used for Modulus is updated to 23.07.
Version
1.1.0
On which installation method(s) does this occur?
Docker
Describe the issue
The multi-node run fail during the CUDA Graph capture due to NCCL watchdog thread errors. The error logs look something like below:
This is mostly due to the following issue: https://github.com/pytorch/pytorch/pull/104487#issuecomment-1638665876
A current workaround is to add a time delay between the warmup and the start of capture to allow the NCCL watchdogs to clean up work before starting the capture. This workaround will not be required after the Pytorch base container version used for Modulus is updated to
23.07
.Minimum reproducible example
No response
Relevant log output
No response
Environment details
No response
Other/Misc.
No response