Closed ProHuper closed 3 weeks ago
$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX SYS SYS SYS SYS SYS 0-47,96-143 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS 0-47,96-143 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS PIX SYS SYS SYS SYS 0-47,96-143 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS 0-47,96-143 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX SYS SYS SYS 48-95,144-191 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS 48-95,144-191 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS PIX SYS SYS 48-95,144-191 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS PIX SYS 48-95,144-191 1 N/A NIC0 PIX SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS NIC1 SYS SYS PIX SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS NIC2 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS X SYS SYS SYS NIC3 SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS X SYS SYS NIC4 SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS X SYS NIC5 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_4 NIC3: mlx5_5 NIC4: mlx5_6 NIC5: mlx5_bond_0
2 nodes allreduce test,8 H100 each node,using 4 nics,busbw is 309,theoretical busbw should be 360。
$ mpirun --allow-run-as-root --hostfile hosts.txt --oversubscribe -x NCCL_ALGO=Tree -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_4,mlx5_5 -np 16 ./all_reduce_perf -b 2M -e 16G -f 2 -n 10 -g 1 -w 10 # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 2097152 524288 float sum -1 118.1 17.75 33.29 0 92.65 22.64 42.44 0 4194304 1048576 float sum -1 104.8 40.01 75.03 0 105.4 39.78 74.59 0 8388608 2097152 float sum -1 140.7 59.60 111.75 0 142.9 58.72 110.10 0 16777216 4194304 float sum -1 231.9 72.33 135.62 0 237.8 70.56 132.29 0 33554432 8388608 float sum -1 412.3 81.39 152.60 0 417.3 80.40 150.75 0 67108864 16777216 float sum -1 663.5 101.14 189.64 0 672.7 99.76 187.05 0 134217728 33554432 float sum -1 1168.2 114.89 215.42 0 1311.3 102.35 191.91 0 268435456 67108864 float sum -1 2130.3 126.01 236.27 0 2130.6 125.99 236.23 0 536870912 134217728 float sum -1 3611.0 148.68 278.77 0 3603.2 149.00 279.37 0 1073741824 268435456 float sum -1 6793.3 158.06 296.36 0 6781.1 158.34 296.89 0 2147483648 536870912 float sum -1 13184 162.89 305.41 0 13129 163.56 306.68 0 4294967296 1073741824 float sum -1 25986 165.28 309.90 0 25893 165.87 311.01 0
2 nodes allreduce test,1 H100 each node,using 4 nics,busbw is 50,theoretical busbw should be 200。
$ mpirun --allow-run-as-root --hostfile hosts.txt --oversubscribe -x NCCL_ALGO=Tree -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_4,mlx5_5 -np 2 ./all_reduce_perf -b 2M -e 16G -f 2 -n 10 -g 1 -w 10 # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 2097152 524288 float sum -1 113.2 18.53 18.53 0 93.35 22.46 22.46 0 4194304 1048576 float sum -1 154.4 27.16 27.16 0 153.3 27.37 27.37 0 8388608 2097152 float sum -1 231.4 36.24 36.24 0 227.8 36.83 36.83 0 16777216 4194304 float sum -1 420.5 39.90 39.90 0 419.9 39.95 39.95 0 33554432 8388608 float sum -1 812.3 41.31 41.31 0 808.2 41.52 41.52 0 67108864 16777216 float sum -1 1545.1 43.43 43.43 0 1561.3 42.98 42.98 0 134217728 33554432 float sum -1 2973.1 45.14 45.14 0 2970.4 45.19 45.19 0 268435456 67108864 float sum -1 5715.9 46.96 46.96 0 5676.1 47.29 47.29 0 536870912 134217728 float sum -1 11146 48.17 48.17 0 11156 48.12 48.12 0 1073741824 268435456 float sum -1 22062 48.67 48.67 0 21997 48.81 48.81 0 2147483648 536870912 float sum -1 43733 49.10 49.10 0 43697 49.15 49.15 0 4294967296 1073741824 float sum -1 87278 49.21 49.21 0 87197 49.26 49.26 0 8589934592 2147483648 float sum -1 174121 49.33 49.33 0 174234 49.30 49.30 0 17179869184 4294967296 float sum -1 347919 49.38 49.38 0 347833 49.39 49.39 0 LOG INFO shows GDR only used 1 nic. qh100-gpu20:38570:38584 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA qh100-gpu20:38570:38584 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA qh100-gpu20:38570:38584 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA qh100-gpu20:38570:38584 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA qh100-gpu20:38570:38584 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA qh100-gpu20:38570:38584 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA qh100-gpu20:38570:38584 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA qh100-gpu20:38570:38584 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA qh100-gpu19:48036:48051 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA qh100-gpu19:48036:48051 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA qh100-gpu19:48036:48051 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA qh100-gpu19:48036:48051 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA qh100-gpu19:48036:48051 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA qh100-gpu19:48036:48051 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA qh100-gpu19:48036:48051 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA qh100-gpu19:48036:48051 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA qh100-gpu19:48036:48049 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2. qh100-gpu20:38570:38582 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2. qh100-gpu20:38570:38582 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. qh100-gpu19:48036:48049 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. qh100-gpu19:48036:48051 [0] NCCL INFO Connected all rings qh100-gpu20:38570:38584 [0] NCCL INFO Connected all rings
2 nodes allreduce test,8 H100 each node,using 4 nics,busbw is 309,theoretical busbw should be 360。
2 nodes allreduce test,1 H100 each node,using 4 nics,busbw is 50,theoretical busbw should be 200。