Open zhaowujin opened 1 week ago
Could it be an ACS issue?
That means I can only choose to connect the 8 GPUs through nvlink or have all 8 GPUs connected through PCIe?
No, that was just asking to check whether ACS was enabled (if running baremetal) and if it was enabled, try to disable it and see if it fixes the issue.
No, that was just asking to check whether ACS was enabled (if running baremetal) and if it was enabled, try to disable it and see if it fixes the issue.
Finally, I closed IOMMU in the BIOS. For specific operation reference, https://github.com/pytorch/pytorch/issues/84803 . thank you!
If I have an eight gpus machine, only the 6 and 7 gpu have nvlinks between them, and the other cards do not. Only 6 and 7 can communicate normally, and the communication data between other gpus is all incorrect. I think in this situation, automatic recognition and PCIE should be used. The detailed errors are as follows:
6 and 7 GPU is normal.
Other gpus is all incorrect