NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
906 stars 243 forks source link

Got very low performance of nccl-tests on A100 with NVLink over 200Gb RoCE network #62

Open weberxie opened 3 years ago

weberxie commented 3 years ago

Environment:

2 nodes with A100 GPUs intra-connected with PCIe Gen 4 and NVLink, inter-connected with 200Gb RoCE network.

the NCCL version is: 2.7.8

the CUDA version is: 11.0

the result of ib_write_bw is about 180Gb,

the GPUs topo is: Screen Shot 2020-12-23 at 7 12 03 PM

the result of ibstatus command is:

Screen Shot 2020-12-23 at 7 11 25 PM

the nccl-tests command is: mpirun -np 16 --allow-run-as-root -bind-to none -map-by slot --mca pml ob1 --mca btl_vader_single_copy_mechanism none --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_tcp_if_exclude lo,docker0 --mca orte_base_help_aggregate 0 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 -x NCCL_SOCKET_IFNAME=^lo,docker0 -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=ALL -x NCCL_IB_HCA=mlx5_1:1,^mlx5_0 --hostfile /data1/hostfile.txt --mca btl openib,self,vader /data1/nccl-tests/build/all_reduce_perf -b 32 -e 128M -f 2

the result of nccl-tests on 2 nodes is:

Screen Shot 2020-12-23 at 7 15 29 PM

The NCCL log is attached: nccl.log.rank0.txt

however the result of nccl-tests on 1 node is:

Screen Shot 2020-12-23 at 7 23 38 PM

So, Could anyone help me to figure out why the performance is so bad? Thanks in advance!

weberxie commented 3 years ago

Upgraded the NCCL version to 2.8.3 and re-run the tests, result is:

Screen Shot 2020-12-23 at 8 50 18 PM
weberxie commented 3 years ago

Update the test results with GPUDirect RDMA enabled:

Screen Shot 2020-12-24 at 7 23 09 PM

So, the last problems is, why the performance was so bad when GPUDirect RDMA is disabled?

kwen2501 commented 3 years ago

Without GPUDirect RDMA, data from/to the GPU has to go through the system memory to/from the NIC. Hence the PCI-e uplink and downlink to/from the system memory would have to be each traversed twice, lowering the performance.

seelam commented 3 years ago
[0] NCCL INFO CPU/0 (1/2/-1)
[0] NCCL INFO + SYS[5000.0] - CPU/1
[0] NCCL INFO + PCI[12.0] - PCI/1000
[0] NCCL INFO               + PCI[12.0] - PCI/6000
[0] NCCL INFO                             + PCI[12.0] - PCI/C000
[0] NCCL INFO                                           + PCI[12.0] - GPU/E000 (0)
[0] NCCL INFO                                                         + NVL[252.0] - NVS/0
[0] NCCL INFO               + PCI[12.0] - PCI/F000
[0] NCCL INFO                             + PCI[12.0] - PCI/11000
[0] NCCL INFO                                           + PCI[12.0] - GPU/13000 (1)
[0] NCCL INFO                                                         + NVL[252.0] - NVS/0
[0] NCCL INFO               + PCI[12.0] - PCI/3000
[0] NCCL INFO                             + PCI[12.0] - NIC/5000
[0] NCCL INFO                                           **+ NET[25.0] - NET/0 (603cb0003723f04/1/25.000000)**
[0] NCCL INFO + PCI[12.0] - PCI/3D000
[0] NCCL INFO               + PCI[12.0] - PCI/45000
[0] NCCL INFO                             + PCI[12.0] - PCI/48000
[0] NCCL INFO                                           + PCI[12.0] - GPU/4A000 (2)
[0] NCCL INFO                                                         + NVL[252.0] - NVS/0
[0] NCCL INFO               + PCI[12.0] - PCI/4C000
[0] NCCL INFO                             + PCI[12.0] - PCI/4E000
[0] NCCL INFO                                           + PCI[12.0] - GPU/50000 (3)
[0] NCCL INFO                                                         + NVL[252.0] - NVS/0
[0] NCCL INFO CPU/1 (1/2/-1)
[0] NCCL INFO + SYS[5000.0] - CPU/0
[0] NCCL INFO + PCI[12.0] - PCI/7D000
[0] NCCL INFO               + PCI[12.0] - PCI/8F000
[0] NCCL INFO                             + PCI[12.0] - PCI/91000
[0] NCCL INFO                                           + PCI[12.0] - GPU/93000 (4)
[0] NCCL INFO                                                         + NVL[252.0] - NVS/0
[0] NCCL INFO               + PCI[12.0] - PCI/95000
[0] NCCL INFO                             + PCI[12.0] - PCI/97000
[0] NCCL INFO                                           + PCI[12.0] - GPU/99000 (5)
[0] NCCL INFO                                                         + NVL[252.0] - NVS/0
[0] NCCL INFO + PCI[12.0] - PCI/C5000
[0] NCCL INFO               + PCI[12.0] - PCI/C7000
[0] NCCL INFO                             + PCI[12.0] - PCI/C9000
[0] NCCL INFO                                           + PCI[12.0] - GPU/CB000 (6)
[0] NCCL INFO                                                         + NVL[252.0] - NVS/0
[0] NCCL INFO               + PCI[12.0] - PCI/CC000
[0] NCCL INFO                             + PCI[12.0] - PCI/CE000
[0] NCCL INFO                                           + PCI[12.0] - GPU/D0000 (7)
[0] NCCL INFO                                                         + NVL[252.0] - NVS/0
[0] NCCL INFO ==========================================

That essentially shows that you have are bottlenecked by the PCI-E Gen3 link connecting your CPU and NIC. However, I would expect the perf to be better than what you reported above.

tonysy commented 3 years ago

any update or solution to improve the performance on A100 with NVLink?

AddyLaddy commented 3 years ago

any update or solution to improve the performance on A100 with NVLink?

What performance issue? With GPUDirect RDMA enabled the above performance looks good to me.

corrtia commented 2 years ago

Check if GPUDirect RDMA is loaded,the command is:

$ lsmod | grep peer
nvidia_peermem         16384  0

If you don't have nvidia_peermem you can load it with this command:

$ modprobe nvidia_peermem