Open kingder opened 3 years ago
This is weird. Your nvidia-smi topo -m
shows mlx5_0 and GPUs 0/1 to be on the same PCI switch, but the NCCL topology shows the GPU-NIC communication needs to go through the CPU, which explains why we get bad performance.
It would be interesting to run again on all 8 GPUs of each node and see if NCCL can use GPU Direct RDMA with GPUs close to the NIC.
Also note NCCL cannot properly detect the PCI Gen4 speeds and reverts to Gen3. This is probably because you are running an old kernel/distro (e.g. Ubuntu 16). It might not change anything in the end for this particular situation, but it could be a problem in some corner cases.
This is weird. Your
nvidia-smi topo -m
shows mlx5_0 and GPUs 0/1 to be on the same PCI switch, but the NCCL topology shows the GPU-NIC communication needs to go through the CPU, which explains why we get bad performance.It would be interesting to run again on all 8 GPUs of each node and see if NCCL can use GPU Direct RDMA with GPUs close to the NIC.
Also note NCCL cannot properly detect the PCI Gen4 speeds and reverts to Gen3. This is probably because you are running an old kernel/distro (e.g. Ubuntu 16). It might not change anything in the end for this particular situation, but it could be a problem in some corner cases.
Thanks for the reply.
Yeah, we use CentOS 7.6 with kernel 3.10.0-957.el7.x86_64
After searching the issues and also the nccl troubleshooting, I could see two potential problems:
1, We haven't enabled GPU Direct RDMA.
lspci -vvv
could these two be the main reasons of the bad performance?
Below is the output of running on all 8 GPUs of each node: log.txt
Ah, right, that is likely the reason why NIC-GPU distance is shown as PHB in the NCCL topology: if GPU Direct RDMA is not available, we will have to go through the CPU for NIC-GPU transfers hence we show PHB. I misread the topology, indeed GPU and NIC are connected through a PCI switch (PCI/13000
).
Also disabling ACS will probably help performance.
Ah, right, that is likely the reason why NIC-GPU distance is shown as PHB in the NCCL topology: if GPU Direct RDMA is not available, we will have to go through the CPU for NIC-GPU transfers hence we show PHB. I misread the topology, indeed GPU and NIC are connected through a PCI switch (
PCI/13000
).Also disabling ACS will probably help performance.
So, here GPU Direct RDMA is the main reason for the poor performance, right?
I'm not very familiar with the topology, based on what I understand, correct me if I'm wrong, here NIC-GPU shows PXB, so enable GDR is required, what if the NIC-GPU shows PIX, will GDR still be a must?
Also, should we also disable IOMMU, because we previously encountered hangs / slow when run p2pBandwidthLatencyTest on a machine with 8 GPUs with no NVLINKs and IOMMU enabled, after disable IOMMU, the test works fine; This time we have NVLINKs and IOMMU / ACS are both enabled, p2pBandwidthLatencyTest / nccl-test works fine on single machine, while slow across nodes.
Yes the lack of GPU Direct RDMA is the reason for the low performance.
PXB or PIX means that GPU and NIC are connected through PCI Switches, in which case GPU Direct RDMA is a must.
Disabling IOMMU/ACS is important in general for PCI communication, so you won't have problems with NVLink but for the networking part because we use PCI it will become important.
Yes the lack of GPU Direct RDMA is the reason for the low performance.
PXB or PIX means that GPU and NIC are connected through PCI Switches, in which case GPU Direct RDMA is a must.
Disabling IOMMU/ACS is important in general for PCI communication, so you won't have problems with NVLink but for the networking part because we use PCI it will become important.
Thanks! After enable GDR, the performance between 2 machines is much better, we can get 12GB/s Avg:
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 262144 float sum 402.9 2.60 4.88 4e-07 401.9 2.61 4.89 4e-07
2097152 524288 float sum 544.7 3.85 7.22 4e-07 499.3 4.20 7.88 4e-07
4194304 1048576 float sum 797.1 5.26 9.87 4e-07 795.3 5.27 9.89 4e-07
8388608 2097152 float sum 1385.5 6.05 11.35 4e-07 1383.2 6.06 11.37 4e-07
16777216 4194304 float sum 2241.0 7.49 14.04 4e-07 2228.7 7.53 14.11 4e-07
33554432 8388608 float sum 4186.7 8.01 15.03 4e-07 4196.3 8.00 14.99 4e-07
67108864 16777216 float sum 7382.4 9.09 17.04 4e-07 7385.8 9.09 17.04 4e-07
134217728 33554432 float sum 14310 9.38 17.59 4e-07 13969 9.61 18.02 4e-07
268435456 67108864 float sum 27716 9.69 18.16 4e-07 27177 9.88 18.52 4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 12.8822
#
But when tested on 512 GPUS (64 nodes), the bandwidth dropped to ~7 GB/s Avg:
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 262144 float sum 1794.9 0.58 1.17 6e-06 784.0 1.34 2.67 6e-06
2097152 524288 float sum 1099.9 1.91 3.81 6e-06 1099.3 1.91 3.81 6e-06
4194304 1048576 float sum 1653.8 2.54 5.06 6e-06 1671.6 2.51 5.01 6e-06
8388608 2097152 float sum 2703.0 3.10 6.19 6e-06 2685.9 3.12 6.23 6e-06
16777216 4194304 float sum 4868.5 3.45 6.88 6e-06 4865.6 3.45 6.88 6e-06
33554432 8388608 float sum 8930.8 3.76 7.50 6e-06 8860.5 3.79 7.56 6e-06
67108864 16777216 float sum 17292 3.88 7.75 6e-06 17657 3.80 7.59 6e-06
134217728 33554432 float sum 27960 4.80 9.58 9e-06 27576 4.87 9.72 9e-06
268435456 67108864 float sum 42719 6.28 12.54 9e-06 42176 6.36 12.70 9e-06
# Out of bounds values : 0 OK
# Avg bus bandwidth : 6.81374
#
Does this sounds normal to you?
The average over different sizes does not make a lot of sense, so you should run up to 4G and see what is the peak BW you can achieve. Also on two nodes we have a special case which doesn't reflect the NIC bandwidth.
So to really see what your network is capable of, you should run with NCCL_ALGO=RING
and up to 4G, and take the maximum bandwidth.
Hi, I have similar problem with #307, two machines in a cluster connected with 200Gb/sec bandwidth infiniband. ibstatus:
ib_send_bw shows:
nvidia-smi topo -m shows:
but nccl-tests only achieves about 3GB/s, which is far below the bandwidth,
attach the detailed log of command: NCCL_NET_GDR_READ=1 NCCL_DEBUG_SUBSYS=GRAPH NCCL_DEBUG=INFO ./build/all_reduce_perf -g 1 -b 1M -e 64M -f 2 log.txt
Any ideas on what could be going wrong here?