NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

Why NCCL doesn't use multiple ports in one nic for allreduce? #1010

Closed holmes313 closed 11 months ago

holmes313 commented 12 months ago

I have 2 servers and each server has one GPU and one CX nic, but one nic has 2 ports. I expected NCCL will use 2 ports when we run the allreduce_perf, but the log shows NCCL only use one nic for inter-node communication. How can we use multiple ports on this scenario for each GPU? Thanks.

nvidia-smi topo -m output on server 1: GPU0 GPU1 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS NODE NODE 0-27,56-83 0 N/A GPU1 SYS X SYS SYS 28-55,84-111 1 N/A NIC0 NODE SYS X PIX NIC1 NODE SYS PIX X

nvidia-smi topo -m output on server 2: GPU0 GPU1 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS NODE NODE 0-27,56-83 0 N/A GPU1 SYS X SYS SYS 28-55,84-111 1 N/A NIC0 NODE SYS X PIX NIC1 NODE SYS PIX X

nccl-test command: mpirun -np 2 --host srvD1,srvD2 -genv NCCL_DEBUG=INFO ./build/alltoall_perf -b 128M -e 4G -f 2 -g 1n

Part of the log output: PowerEdge-R750xa-D1:100867:100936 [0] NCCL INFO Channel 00/0 : 1[17000] -> 0[17000] [receive] via NET/IB/0 PowerEdge-R750xa-D2:47569:47599 [0] NCCL INFO Channel 00/0 : 0[17000] -> 1[17000] [receive] via NET/IB/0 PowerEdge-R750xa-D1:100867:100936 [0] NCCL INFO Channel 01/0 : 1[17000] -> 0[17000] [receive] via NET/IB/0 PowerEdge-R750xa-D2:47569:47599 [0] NCCL INFO Channel 01/0 : 0[17000] -> 1[17000] [receive] via NET/IB/0 PowerEdge-R750xa-D1:100867:100936 [0] NCCL INFO Channel 00/0 : 0[17000] -> 1[17000] [send] via NET/IB/0 PowerEdge-R750xa-D2:47569:47599 [0] NCCL INFO Channel 00/0 : 1[17000] -> 0[17000] [send] via NET/IB/0 PowerEdge-R750xa-D1:100867:100936 [0] NCCL INFO Channel 01/0 : 0[17000] -> 1[17000] [send] via NET/IB/0 PowerEdge-R750xa-D2:47569:47599 [0] NCCL INFO Channel 01/0 : 1[17000] -> 0[17000] [send] via NET/IB/0

sjeaugey commented 12 months ago

What is the speed of each port and what is the speed of the PCI link to the NIC? If NCCL finds that the PCI speed is too low, and we won't get better performance using 2 ports, then it won't use both ports as it would provide no performance benefit but would use more GPU resources.

holmes313 commented 11 months ago

Hello Sjeaugey, thanks for your reply. I am using one CX6 with 2200Gb/s ports and PCIE4.0. Does the NCCL will detect 2 200Gb/s lager than PCIE4.0 bandwidth? So NCCL will only use one port for transportation. If I am using PCIE5.0, NCCL will use 2 ports and split the message for each of the port, right?

sjeaugey commented 11 months ago

Indeed, if the PCI BW is Gen4 (considered as 24GB/s) then NCCL won't attempt to use both ports. If it is Gen5 (48 GB/s) then NCCL will try to create more flows to use all ports.