Closed Jason3900 closed 5 months ago
Typically such problems are caused by the fabric manager not being restarted correctly. Please consult the following: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf (the sequence you need is at the end of Section 2.2).
Appreciate it! I follow the instructions you provide and successfully solved the problem.
I'm using NCCL version 2.21.5+cuda12.4, nvidia-driver: 550.54.15 and the same version of nvidia-fabricmanager. I run nccl-test on a single machine and got error of "Invalid argument"
logs of error:
I saw there're similar issues but I've tried the possible solution (reboot, restart nvidia-fabricmanager), but none of these works.