NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.22k stars 810 forks source link

The communication speed of ib card is not as expected #685

Open wangshengnan511799 opened 2 years ago

wangshengnan511799 commented 2 years ago

We are using the A100 ib card for communication. The bandwidth of each ib card is 7GB, but only 1GB/s is got, according to the statistical result from ibdump

sjeaugey commented 2 years ago

Hi, can you give more information as to the NCCL performance you are getting, and the topology of your system: GPU type (A100?), NIC type (CX6?), PCI speed (Gen4x16?), whether there is NVLink or not, etc ...

Or just the output of the NCCL perf test with NCCL_DEBUG=INFO and the node topology you can dump setting NCCL_TOPO_DUMP_FILE=system.txt (then attach system.txt here alongside the output log).

Thanks!

wangshengnan511799 commented 2 years ago

thanks for you response. We are using an A100 server with 8 gpu cards and the server also has 8 ib network cards. We found that 1) only four ib cards can work simultaneously at most; 2) the bandwidth of each ib card is 56G (7G Byte), but when I only export one ib card and use ibdump, I see only about 1 GByte/s communication speed

wangshengnan511799 commented 2 years ago

The following is the system.txt files. system1.txt only exports one ibcard mlx5_15, and system exports all 8 ibcards. system_2.txt system_1.txt

wangshengnan511799 commented 2 years ago

system_2.txt exports all 8 ibcards, but the ibdump results show that only 4 ibcards have data communication, though all 8 cards are active. system_1.txt only exports one ibcard, but we see that the communication speed is slower than expected

twoflypig commented 2 years ago

Hello, NCCL INFO log can be seen as follows:

nccl_info.log

sjeaugey commented 2 years ago

Sorry I wasn't precise enough. For the NCCL log and topology, I'd need you to run all_reduce_perf on all 16 GPUs. Running on a single GPU doesn't say much about what NCCL does internally. Also I'd like to see the output of all_reduce_perf -b 8 -e 4G -f 2 to see what bandwidth in GB/s we get. Given your NICs are 56Gb/s, I'd expect a bus bandwidth of ~6.5 GB/s with NCCL_ALGO=RING, and probably a bit more (10GB/s?) with the default. That would only use one NIC however. I'm not sure I understand what you did to conclude only 4 IB cards were used out of 8. Did you run using two nodes with 8 NICs each?

Also on system2, I only see 7 cards in the system topology: mlx5_9, to mlx5_15. It seems mlx5_8 is missing or its port is not in ACTIVE state.

twoflypig commented 2 years ago

For the NCCL log and topology, I'd need you to run all_reduce_perf on all 16 GPUs.

Here is the AllReducePerf of 16 GPUs on two nodes:

nccl-perf2.log

Using the command below.

mpirun -np 16 --hostfile host16p  -x NCCL_IB_HCA=mlx5_15 -x NCCL_SOCKET_IFNAME=ib -x NCCL_DEBUG=INFO  -x PATH -x LD_LIBRARY_PATH ./build/all_reduce_perf -b 8 -e 2G -f 2 -o sum > nccl-perf2.log

Given your NICs are 56Gb/s, I'd expect a bus bandwidth of ~6.5 GB/s with NCCL_ALGO=RING

I test the results using NCCL_ALGO=RING, and got the same outputs Avg bus bandwidth : 1.35 using

'm not sure I understand what you did to conclude only 4 IB cards were used out of 8. Did you run using two nodes with 8 NICs each? Also on system2, I only see 7 cards in the system topology: mlx5_9, to mlx5_15. It seems mlx5_8 is missing or its port is not in ACTIVE state.

Yeah, the ibcard mlx5_8 was not active last time and now it is active, please see the system.txt below. We use two gpu severs with totall 16 gpu cards to run our code. The following are two log files (reported by both of the two servers). "4 IB cards were used out of 8" means that we export all 8 ibcards(from mlx5_8 to mlx5_15), but when I use "ibdump -d mlx5_8" to "ibdump -d mlx5_15", four ibcards show no communication. Specifically, only mlx5_11, mlx5_12, mlx5_9, mlx5_15 successfully communicate data.

log_1.txt log_2.txt system.txt

sjeaugey commented 2 years ago

The system.txt was generated by a communicator which only had one GPU per node so I only see GPU 5. To get the full node topology I need to get it from a communicator running on all GPUs.

Did you get that file running the all_reduce_perf test or running PyTorch? Please get it from all_reduce_perf, not PyTorch (which is creating many communicators, some with all GPUs some with less).

Your all_reduce_perf bandwidth peaks at 2.6 GB/s. The average doesn't mean much when running across small and large sizes. This is still less than it should be though.

In any case, a couple of comments:

twoflypig commented 2 years ago

Thank you for your response. We checked your points and here are the facts:

  1. Here are the systems files generated by nccl-perf and pytorch. nccl_perf_system.txt pytorch_system16.txt

  2. We don't have GDRDMA, will it cause the speed to be low?

  3. May it be the same reason (without GDRDMA) that causes only "4 IB cards were used out of 8"?

Can you run again with mpirun --bind-to numa ? By default MPI binds each task to a single core which can impact the NCCL network performance negatively.

I tried that and the output is same.

sjeaugey commented 2 years ago
  1. Thanks, this time I see all 8 GPUs and all 8 IB ports. It would have probably been better to spread the NICs so that we have one NIC (2 ports) per pair of GPUs instead of having 4 GPUs managing 8 ports and 4 GPUs with no local NIC; alltoall performance would be better that way, and there could be other issues arising from that imbalance, including using only half the ports.
  2. Yes without GDRDMA it's likely performance will be lower as all traffic needs to go through CPU memory and that could well be the bottleneck if your PCI<->CPU memory is slow, especially if your CPU is configured with 1 NUMA node per socket (NPS) which seems to be the case here.
  3. Not sure about this. Could be either the lack of GPU Direct RDMA or the NIC/GPU imbalance.