NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 817 forks source link

How to locate the hanging node? #1324

Open Puzzzle7 opened 4 months ago

Puzzzle7 commented 4 months ago

Currently, when I encounter a timeout error with NCCL, locating the hanging node is quite time-consuming. Does NCCL have a feature to achieve this? If not, could you provide ideas for implementing it in NCCL?

kiskra-nvidia commented 4 months ago

NCCL does not have any such features, but it is something we are currently investigating.