NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 829 forks source link

Invalid Argument when running nccl-test on a single machine with multiple GPUs (H800) #1338

Closed Jason3900 closed 5 months ago

Jason3900 commented 5 months ago

I'm using NCCL version 2.21.5+cuda12.4, nvidia-driver: 550.54.15 and the same version of nvidia-fabricmanager. I run nccl-test on a single machine and got error of "Invalid argument"

NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

logs of error:

......
node2:242431:242479 [5] transport/nvls.cc:158 NCCL WARN Cuda failure 1 'invalid argument'

node2:242431:242475 [1] transport/nvls.cc:158 NCCL WARN Cuda failure 1 'invalid argument'
node2:242431:242475 [1] NCCL INFO transport/nvls.cc:330 -> 1
node2:242431:242475 [1] NCCL INFO init.cc:1277 -> 1

node2:242431:242477 [3] transport/nvls.cc:158 NCCL WARN Cuda failure 1 'invalid argument'

node2:242431:242474 [0] transport/nvls.cc:158 NCCL WARN Cuda failure 1 'invalid argument'
node2:242431:242475 [1] NCCL INFO init.cc:1548 -> 1
node2:242431:242474 [0] NCCL INFO transport/nvls.cc:330 -> 1
node2:242431:242475 [1] NCCL INFO group.cc:64 -> 1 [Async thread]
node2:242431:242474 [0] NCCL INFO init.cc:1277 -> 1

I saw there're similar issues but I've tried the possible solution (reboot, restart nvidia-fabricmanager), but none of these works.

kiskra-nvidia commented 5 months ago

Typically such problems are caused by the fabric manager not being restarted correctly. Please consult the following: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf (the sequence you need is at the end of Section 2.2).

Jason3900 commented 5 months ago

Appreciate it! I follow the instructions you provide and successfully solved the problem.