NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

hungging up and the occupytion of some gpu is 100% #1033

Open lxianl455 opened 11 months ago

lxianl455 commented 11 months ago

I'm training a model with multi 3090 GPUs. It normally starts initially. But the program hangs up at the end of the training of an epoch, there is no more output, and some GPUs are 100% occupied, while others are 0%. The program ends when NCCL timeout. I looked at the output and it showed up at the end: image

image

It looks like some GPUs exited early, causing synchronization to fail. There should be no problem with the code. The code can be run on the data set before data augmentation. But when running on the data set after data enhancement, this problem occurs, like https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out. I tried to export NCCL_P2P_DISABLE=1. But it didn't work.

KaimingOuyang commented 10 months ago

Can you provide me the output of nvidia-smi topo -m and the backtrace when the program hangs? What NCCL version are you using?