Open wangshengnan511799 opened 2 years ago
Hi, can you give more information as to the NCCL performance you are getting, and the topology of your system: GPU type (A100?), NIC type (CX6?), PCI speed (Gen4x16?), whether there is NVLink or not, etc ...
Or just the output of the NCCL perf test with NCCL_DEBUG=INFO
and the node topology you can dump setting NCCL_TOPO_DUMP_FILE=system.txt
(then attach system.txt here alongside the output log).
Thanks!
thanks for you response. We are using an A100 server with 8 gpu cards and the server also has 8 ib network cards. We found that 1) only four ib cards can work simultaneously at most; 2) the bandwidth of each ib card is 56G (7G Byte), but when I only export one ib card and use ibdump, I see only about 1 GByte/s communication speed
The following is the system.txt files. system1.txt only exports one ibcard mlx5_15, and system exports all 8 ibcards. system_2.txt system_1.txt
system_2.txt exports all 8 ibcards, but the ibdump results show that only 4 ibcards have data communication, though all 8 cards are active. system_1.txt only exports one ibcard, but we see that the communication speed is slower than expected
Hello, NCCL INFO log can be seen as follows:
Sorry I wasn't precise enough. For the NCCL log and topology, I'd need you to run all_reduce_perf on all 16 GPUs. Running on a single GPU doesn't say much about what NCCL does internally. Also I'd like to see the output of all_reduce_perf -b 8 -e 4G -f 2
to see what bandwidth in GB/s we get. Given your NICs are 56Gb/s, I'd expect a bus bandwidth of ~6.5 GB/s with NCCL_ALGO=RING, and probably a bit more (10GB/s?) with the default. That would only use one NIC however. I'm not sure I understand what you did to conclude only 4 IB cards were used out of 8. Did you run using two nodes with 8 NICs each?
Also on system2, I only see 7 cards in the system topology: mlx5_9, to mlx5_15. It seems mlx5_8 is missing or its port is not in ACTIVE state.
For the NCCL log and topology, I'd need you to run all_reduce_perf on all 16 GPUs.
Here is the AllReducePerf of 16 GPUs on two nodes:
Using the command below.
mpirun -np 16 --hostfile host16p -x NCCL_IB_HCA=mlx5_15 -x NCCL_SOCKET_IFNAME=ib -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH ./build/all_reduce_perf -b 8 -e 2G -f 2 -o sum > nccl-perf2.log
Given your NICs are 56Gb/s, I'd expect a bus bandwidth of ~6.5 GB/s with NCCL_ALGO=RING
I test the results using NCCL_ALGO=RING
, and got the same outputs Avg bus bandwidth : 1.35
using
'm not sure I understand what you did to conclude only 4 IB cards were used out of 8. Did you run using two nodes with 8 NICs each? Also on system2, I only see 7 cards in the system topology: mlx5_9, to mlx5_15. It seems mlx5_8 is missing or its port is not in ACTIVE state.
Yeah, the ibcard mlx5_8 was not active last time and now it is active, please see the system.txt below. We use two gpu severs with totall 16 gpu cards to run our code. The following are two log files (reported by both of the two servers). "4 IB cards were used out of 8" means that we export all 8 ibcards(from mlx5_8 to mlx5_15), but when I use "ibdump -d mlx5_8" to "ibdump -d mlx5_15", four ibcards show no communication. Specifically, only mlx5_11, mlx5_12, mlx5_9, mlx5_15 successfully communicate data.
The system.txt was generated by a communicator which only had one GPU per node so I only see GPU 5. To get the full node topology I need to get it from a communicator running on all GPUs.
Did you get that file running the all_reduce_perf
test or running PyTorch? Please get it from all_reduce_perf, not PyTorch (which is creating many communicators, some with all GPUs some with less).
Your all_reduce_perf bandwidth peaks at 2.6 GB/s. The average doesn't mean much when running across small and large sizes. This is still less than it should be though.
In any case, a couple of comments:
mpirun --bind-to numa
? By default MPI binds each task to a single core which can impact the NCCL network performance negatively.
via NET/IB/0
instead of via NET/IB/0/GDRDMA
in the log. Is GPU Direct RDMA installed and functional?Thank you for your response. We checked your points and here are the facts:
Here are the systems files generated by nccl-perf and pytorch. nccl_perf_system.txt pytorch_system16.txt
We don't have GDRDMA, will it cause the speed to be low?
May it be the same reason (without GDRDMA) that causes only "4 IB cards were used out of 8"?
Can you run again with mpirun --bind-to numa ? By default MPI binds each task to a single core which can impact the NCCL network performance negatively.
I tried that and the output is same.
We are using the A100 ib card for communication. The bandwidth of each ib card is 7GB, but only 1GB/s is got, according to the statistical result from ibdump