NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

Performance Degradation in Alltoall Operation with NCCL 2.19 and 2.20 #1316

Open GeofferyGeng opened 3 months ago

GeofferyGeng commented 3 months ago

We have observed a significant performance degradation in the alltoall operation when using NCCL versions 2.19 and 2.20 compared to version 2.18.

System Configuration: Max Node: 8 Machine Type: 8 H800 + 4 NvSwtich, 165GB bwIntra Network Configuration: 8 [2 200Gb] NICs Cuda Version : 12.2 Driver Version: 535

Problem Details:

command: date && /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root --mca oob_tcp_if_include bond2 --bind-to none --host $hosts -x UCX_NET_DEVICES=bond2 -x UCX_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME==bond2 -x NCCL_IB_HCA==mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x NCCL_MIN_NCHANNELS=16 -x NCCL_NCHANNELS_PER_NET_PEER=8-x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/dev/stderr -x NCCL_IB_SPLIT_DATA_ON_QPS=0 /root/nccl-tests/build/alltoall_perf --ngpus=1 --minbytes=64M --maxbytes=16G --stepfactor=2 --iters=200 2>/dev/null

Looking forward to your response.

GeofferyGeng commented 3 months ago

More Information, in the sendrecv_perf test, there is no difference between 2.18, 2.19 and 2.10

sjeaugey commented 3 months ago

It could be that in 2.18, by default we'd use 32 channels for collectives, hence 32 channels for p2p. In 2.19 we have reduced the memory footprint and SM usage to something more reasonable, but that may have impacted the alltoall performance.

But first, I'd advise to unset NCCL_NCHANNELS_PER_NET_PEER. Setting it to 8 can have a negative effect on alltoall operations. Can you run the comparison again without that variable set?

GeofferyGeng commented 3 months ago

thank you for your reply.

We remove the NCCL_NCHANNELS_PER_NET_PEER from command and run it on 8 nodes. However, the performance degraded about 2GB. date && /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root --mca oob_tcp_if_include bond2 --bind-to none --host $hosts -x UCX_NET_DEVICES=bond2 -x UCX_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME==bond2 -x NCCL_IB_HCA==mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x NCCL_MIN_NCHANNELS=16 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/dev/stderr -x NCCL_IB_SPLIT_DATA_ON_QPS=0 /root/nccl-tests/build/alltoall_perf --ngpus=1 --minbytes=64M --maxbytes=16G --stepfactor=2 --iters=200 2>/dev/null

As you said, "In 2.19 we have reduced the memory footprint and SM usage to something more reasonable". Is there some environment variables that we can set to force use more SM and get higher performance? We try use NCCL_MIN_P2P_NCHANNELS=16/32 for use more SM, but it doesn't work. date && /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root --mca oob_tcp_if_include bond2 --bind-to none --host $hosts -x UCX_NET_DEVICES=bond2 -x UCX_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME==bond2 -x NCCL_IB_HCA==mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x NCCL_MIN_NCHANNELS=16 -x NCCL_MIN_P2P_NCHANNELS=16 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/dev/stderr -x NCCL_IB_SPLIT_DATA_ON_QPS=0 /root/nccl-tests/build/alltoall_perf --ngpus=1 --minbytes=64M --maxbytes=16G --stepfactor=2 --iters=200 2>/dev/null

sjeaugey commented 3 months ago

Sorry I missed the GPUs were H800. The number of channels would likely be limited, due to the number of NVLinks, so my theory doesn't hold (and your experiments confirmed that).

Unfortunately I don't see much else you could play with to optimize the alltoall performance. Given that you have 2 ports per NIC, I'm wondering whether NCCL_IB_QPS_PER_CONNECTION=2 could hurt, having to progress too many QPs at once.

On the other hand, given you're setting that environment variable, I'm guessing the fabric is RoCE. Given the lack of a good adaptive routing on most RoCE fabrics, optimizing performance on RoCE can be tricky and any change in the algorithm/chunk size/timing can make performance go up or down, so it goes beyond NCCL.

GeofferyGeng commented 3 months ago

Sorry I missed the GPUs were H800. The number of channels would likely be limited, due to the number of NVLinks, so my theory doesn't hold (and your experiments confirmed that).

Unfortunately I don't see much else you could play with to optimize the alltoall performance. Given that you have 2 ports per NIC, I'm wondering whether NCCL_IB_QPS_PER_CONNECTION=2 could hurt, having to progress too many QPs at once.

On the other hand, given you're setting that environment variable, I'm guessing the fabric is RoCE. Given the lack of a good adaptive routing on most RoCE fabrics, optimizing performance on RoCE can be tricky and any change in the algorithm/chunk size/timing can make performance go up or down, so it goes beyond NCCL.

NCCL_IB_QPS_PER_CONNECTION does hurt the performance. On 2.18 version, the more qps we use in one connection, the performance lower; But we use bond mode nic, so that the least qps for every connection is 2. Considering 2.18 can reach a satisfying bw, 2 may be suitable.

We tests more combination of variables, finally we found increasing NCCL_NCHANNELS_PER_NET_PEER to 32 can bring a bit perfomance and the Switch port reached 85% usage at last.

If you have any other suggestions at any time, I would be very grateful.