NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.13k stars 791 forks source link

Not all gpus have nvlinks, the communication data is all incorrect #1423

Open zhaowujin opened 1 week ago

zhaowujin commented 1 week ago

If I have an eight gpus machine, only the 6 and 7 gpu have nvlinks between them, and the other cards do not. Only 6 and 7 can communicate normally, and the communication data between other gpus is all incorrect. I think in this situation, automatic recognition and PCIE should be used. The detailed errors are as follows:

Image

6 and 7 GPU is normal.

Image

Other gpus is all incorrect

sjeaugey commented 1 week ago

Could it be an ACS issue?

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

zhaowujin commented 1 week ago

Could it be an ACS issue?

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

That means I can only choose to connect the 8 GPUs through nvlink or have all 8 GPUs connected through PCIe?

sjeaugey commented 1 week ago

No, that was just asking to check whether ACS was enabled (if running baremetal) and if it was enabled, try to disable it and see if it fixes the issue.

zhaowujin commented 1 week ago

No, that was just asking to check whether ACS was enabled (if running baremetal) and if it was enabled, try to disable it and see if it fixes the issue.

Finally, I closed IOMMU in the BIOS. For specific operation reference, https://github.com/pytorch/pytorch/issues/84803 . thank you!