Open weberxie opened 3 years ago
Upgraded the NCCL version to 2.8.3 and re-run the tests, result is:
Update the test results with GPUDirect RDMA enabled:
So, the last problems is, why the performance was so bad when GPUDirect RDMA is disabled?
Without GPUDirect RDMA, data from/to the GPU has to go through the system memory to/from the NIC. Hence the PCI-e uplink and downlink to/from the system memory would have to be each traversed twice, lowering the performance.
[0] NCCL INFO CPU/0 (1/2/-1)
[0] NCCL INFO + SYS[5000.0] - CPU/1
[0] NCCL INFO + PCI[12.0] - PCI/1000
[0] NCCL INFO + PCI[12.0] - PCI/6000
[0] NCCL INFO + PCI[12.0] - PCI/C000
[0] NCCL INFO + PCI[12.0] - GPU/E000 (0)
[0] NCCL INFO + NVL[252.0] - NVS/0
[0] NCCL INFO + PCI[12.0] - PCI/F000
[0] NCCL INFO + PCI[12.0] - PCI/11000
[0] NCCL INFO + PCI[12.0] - GPU/13000 (1)
[0] NCCL INFO + NVL[252.0] - NVS/0
[0] NCCL INFO + PCI[12.0] - PCI/3000
[0] NCCL INFO + PCI[12.0] - NIC/5000
[0] NCCL INFO **+ NET[25.0] - NET/0 (603cb0003723f04/1/25.000000)**
[0] NCCL INFO + PCI[12.0] - PCI/3D000
[0] NCCL INFO + PCI[12.0] - PCI/45000
[0] NCCL INFO + PCI[12.0] - PCI/48000
[0] NCCL INFO + PCI[12.0] - GPU/4A000 (2)
[0] NCCL INFO + NVL[252.0] - NVS/0
[0] NCCL INFO + PCI[12.0] - PCI/4C000
[0] NCCL INFO + PCI[12.0] - PCI/4E000
[0] NCCL INFO + PCI[12.0] - GPU/50000 (3)
[0] NCCL INFO + NVL[252.0] - NVS/0
[0] NCCL INFO CPU/1 (1/2/-1)
[0] NCCL INFO + SYS[5000.0] - CPU/0
[0] NCCL INFO + PCI[12.0] - PCI/7D000
[0] NCCL INFO + PCI[12.0] - PCI/8F000
[0] NCCL INFO + PCI[12.0] - PCI/91000
[0] NCCL INFO + PCI[12.0] - GPU/93000 (4)
[0] NCCL INFO + NVL[252.0] - NVS/0
[0] NCCL INFO + PCI[12.0] - PCI/95000
[0] NCCL INFO + PCI[12.0] - PCI/97000
[0] NCCL INFO + PCI[12.0] - GPU/99000 (5)
[0] NCCL INFO + NVL[252.0] - NVS/0
[0] NCCL INFO + PCI[12.0] - PCI/C5000
[0] NCCL INFO + PCI[12.0] - PCI/C7000
[0] NCCL INFO + PCI[12.0] - PCI/C9000
[0] NCCL INFO + PCI[12.0] - GPU/CB000 (6)
[0] NCCL INFO + NVL[252.0] - NVS/0
[0] NCCL INFO + PCI[12.0] - PCI/CC000
[0] NCCL INFO + PCI[12.0] - PCI/CE000
[0] NCCL INFO + PCI[12.0] - GPU/D0000 (7)
[0] NCCL INFO + NVL[252.0] - NVS/0
[0] NCCL INFO ==========================================
That essentially shows that you have are bottlenecked by the PCI-E Gen3 link connecting your CPU and NIC. However, I would expect the perf to be better than what you reported above.
any update or solution to improve the performance on A100 with NVLink?
any update or solution to improve the performance on A100 with NVLink?
What performance issue? With GPUDirect RDMA enabled the above performance looks good to me.
Check if GPUDirect RDMA is loaded,the command is:
$ lsmod | grep peer
nvidia_peermem 16384 0
If you don't have nvidia_peermem you can load it with this command:
$ modprobe nvidia_peermem
Environment:
2 nodes with A100 GPUs intra-connected with PCIe Gen 4 and NVLink, inter-connected with 200Gb RoCE network.
the NCCL version is: 2.7.8
the CUDA version is: 11.0
the result of ib_write_bw is about 180Gb,
the GPUs topo is:
the result of ibstatus command is:
the nccl-tests command is:
mpirun -np 16 --allow-run-as-root -bind-to none -map-by slot --mca pml ob1 --mca btl_vader_single_copy_mechanism none --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_tcp_if_exclude lo,docker0 --mca orte_base_help_aggregate 0 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 -x NCCL_SOCKET_IFNAME=^lo,docker0 -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=ALL -x NCCL_IB_HCA=mlx5_1:1,^mlx5_0 --hostfile /data1/hostfile.txt --mca btl openib,self,vader /data1/nccl-tests/build/all_reduce_perf -b 32 -e 128M -f 2
the result of nccl-tests on 2 nodes is:
The NCCL log is attached: nccl.log.rank0.txt
however the result of nccl-tests on 1 node is:
So, Could anyone help me to figure out why the performance is so bad? Thanks in advance!