NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.11k stars 787 forks source link

NCCL test, Tree is slower than Ring #1368

Open wangdaw2023 opened 1 month ago

wangdaw2023 commented 1 month ago

We have GPU cluster nodes with 8 H100 and 4400 RoCE. I try nccl test on this cluster with the same nodes. But I find tree bus bandwidth(150GB/s) is slower than ring bandwidth (190GB/s). From my understanding, nccl Ring/Tree bus bandwidth should be the same. Any suggestion ?

NCCL test with Ring /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root -np 40 -H 100.64.40.36:8,100.64.40.39:8,100.64.40.41:8,100.64.40.42:8,100.64.40.43:8 --timestamp-output -x NCCL_SOCKET_IFNAME=bond0 -x NCCL_IB_HCA=mlx5_10:1,mlx5_11:1,mlx5_12:1,mlx5_13:1 -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=version -x NCCL_ALGO=Ring -x NCCL_PXN_DISABLE=0 -x NCCL_IB_QPS_PER_CONNECTION=2 -x LD_LIBRARY_PATH=/usr/local/cuda/lib64:/root/nccl_apps/nccl220/lib:/usr/mpi/gcc/openmpi-4.1.7a1/lib /root/nccl_apps/nccl-test/all_reduce_perf -b 1M -e 1G -g 1 -f 2

Thu Jul 18 15:41:05 2024:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong Thu Jul 18 15:41:05 2024:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
Thu Jul 18 15:41:05 2024: 1048576 262144 float sum -1 354.9 2.95 5.76 0 345.7 3.03 5.92 0 Thu Jul 18 15:41:05 2024: 2097152 524288 float sum -1 353.7 5.93 11.56 0 350.4 5.98 11.67 0 Thu Jul 18 15:41:05 2024: 4194304 1048576 float sum -1 359.7 11.66 22.74 0 361.3 11.61 22.64 0 Thu Jul 18 15:41:05 2024: 8388608 2097152 float sum -1 371.5 22.58 44.03 0 373.3 22.47 43.81 0 Thu Jul 18 15:41:05 2024: 16777216 4194304 float sum -1 391.6 42.84 83.54 0 390.7 42.94 83.73 0 Thu Jul 18 15:41:05 2024: 33554432 8388608 float sum -1 450.6 74.46 145.20 0 451.6 74.29 144.87 0 Thu Jul 18 15:41:05 2024: 67108864 16777216 float sum -1 791.7 84.76 165.29 0 769.2 87.25 170.13 0 Thu Jul 18 15:41:05 2024: 134217728 33554432 float sum -1 1467.0 91.49 178.41 0 1467.4 91.47 178.36 0 Thu Jul 18 15:41:06 2024: 268435456 67108864 float sum -1 2892.0 92.82 181.00 0 2891.7 92.83 181.02 0 Thu Jul 18 15:41:06 2024: 536870912 134217728 float sum -1 5452.0 98.47 192.02 0 5450.0 98.51 192.09 0 Thu Jul 18 15:41:07 2024: 1073741824 268435456 float sum -1 10747 99.91 194.82 0 10868 98.80 192.66 0

NCCL test with Tree /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root -np 40 -H 100.64.40.36:8,100.64.40.39:8,100.64.40.41:8,100.64.40.42:8,100.64.40.43:8 --timestamp-output -x NCCL_SOCKET_IFNAME=bond0 -x NCCL_IB_HCA=mlx5_10:1,mlx5_11:1,mlx5_12:1,mlx5_13:1 -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=version -x NCCL_ALGO=Tree -x NCCL_PXN_DISABLE=0 -x NCCL_IB_QPS_PER_CONNECTION=2 -x LD_LIBRARY_PATH=/usr/local/cuda/lib64:/root/nccl_apps/nccl220/lib:/usr/mpi/gcc/openmpi-4.1.7a1/lib /root/nccl_apps/nccl-test/all_reduce_perf -b 1M -e 1G -g 1 -f 2

Thu Jul 18 15:40:10 2024:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong Thu Jul 18 15:40:10 2024:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
Thu Jul 18 15:40:10 2024: 1048576 262144 float sum -1 120.2 8.72 17.01 0 119.2 8.80 17.16 0 Thu Jul 18 15:40:10 2024: 2097152 524288 float sum -1 140.3 14.95 29.14 0 139.6 15.02 29.29 0 Thu Jul 18 15:40:10 2024: 4194304 1048576 float sum -1 207.6 20.20 39.39 0 208.7 20.10 39.20 0 Thu Jul 18 15:40:10 2024: 8388608 2097152 float sum -1 240.0 34.96 68.17 0 233.7 35.89 69.98 0 Thu Jul 18 15:40:10 2024: 16777216 4194304 float sum -1 339.5 49.41 96.35 0 342.8 48.95 95.45 0 Thu Jul 18 15:40:10 2024: 33554432 8388608 float sum -1 549.9 61.02 118.99 0 548.4 61.18 119.30 0 Thu Jul 18 15:40:10 2024: 67108864 16777216 float sum -1 975.2 68.81 134.18 0 969.0 69.26 135.05 0 Thu Jul 18 15:40:10 2024: 134217728 33554432 float sum -1 1810.1 74.15 144.59 0 1876.5 71.53 139.48 0 Thu Jul 18 15:40:10 2024: 268435456 67108864 float sum -1 3744.0 71.70 139.81 0 3844.9 69.82 136.14 0 Thu Jul 18 15:40:11 2024: 536870912 134217728 float sum -1 7649.7 70.18 136.85 0 7584.8 70.78 138.03 0 Thu Jul 18 15:40:12 2024: 1073741824 268435456 float sum -1 13972 76.85 149.85 0 13953 76.95 150.06 0

echobinarybytes commented 1 month ago

some hint here https://github.com/NVIDIA/nccl/issues/471#issuecomment-789088941

wangdaw2023 commented 1 month ago

From the post, it says "a flat ring (high latency, best bandwidth) and a tree (low latency, ok bandwidth)."

Tree is slower than Ring since Ring could have best bandwidth.

So, in my understanding, training LLM in scale with ring would get best bandwidth for exchanging gradients/activations between ranks, which is suitable for big data bucket. Tree is prefered when we emphasize latency and small data bucket.