Open SusuXu opened 2 years ago
Here is some env info. System: Ubuntu 18.4 CUDA: 11.7 NCCL: 2.14.3 GPU: 4 NVIDIA RTX A6000.
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X SYS SYS SYS 0-47 N/A
GPU1 SYS X SYS SYS 0-47 N/A
GPU2 SYS SYS X SYS 0-47 N/A
GPU3 SYS SYS SYS X 0-47 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
It turns out to be a bandwidth issue. After we set NCCL_P2P_DISABLE = 1, the test can run through, but with super low bandwidth. We are using AMD CPU EPYC 7402P, would that be an issue?
From what you describe, it would seem that GPU Direct is not functional through the CPU. You may want to disable iommu by adding "iommu=pt" to the kernel launch arguments.
Without P2P, all traffic goes through CPU memory. On AMD CPUs, the performance of memory accesses from the GPU is heavily impacted by the NPS setting in the BIOS (NUMA nodes per socket). I'd advise to try setting NPS to 4 and see if performance is better.
thanks! setting amd_iommu=off helps address the communication problem, though the average bandwidth is still low (~6GB/s).
Setting NPS to 4 slightly increases the bandwidth by 1GB/s.
Thanks for your help!
the average bandwidth is still low (~6GB/s).
Do you mean the "Average BW" printed at the end? This has no meaning if you run with -b 8 -e 1G -f 2
as it would average bandwidths for all sizes, and bandwidth does not make sense for very small operations as they are purely latency-bound.
What is the BusBW of large operations, i.e. what is the value to which the BusBW is converging as sizes increase?
This is the error message when I try to run the test. We got CUDA error and NET error as below. The test finally runs into a deadlock. It runs well if using only one GPU, but fails in two gpus cases. We have make the CUDA_HOME and direct it to /usr/local/cuda. But it keeps throwing the error of not able to find CUDA.
@sjeaugey