NVIDIA / modulus-sym

Framework providing pythonic APIs, algorithms and utilities to be used with Modulus core to physics inform model training as well as higher level abstraction for domain experts
https://developer.nvidia.com/modulus
Apache License 2.0
165 stars 68 forks source link

Hot fix NCCL CUDA Graphs bug #48

Closed ktangsali closed 1 year ago

ktangsali commented 1 year ago

Modulus Pull Request

Description

Closes #47

The workaround fix adds a time delay between the warmup steps and the start of the graph capture to allow enough time for NCCL watchdog to clean-up work. The fix was stress tested against FPGA example, and it passed 20/20 times.

Checklist

Dependencies

ktangsali commented 1 year ago

/blossom-ci