Closed vitduck closed 1 year ago
My guess would be that the gpu37
node does not have nv_peer_mem
loaded.
[Excellent bug report btw; thanks a lot for taking the time to gather the debug log and topology dump -- it really helped!]
Thanks very much for pointing out the issue.
Indeed nv_peer_mem
was not loaded on gpu37
, which was a huge oversight on our part.
We appreciate if you can give some further clarification regarding the log file.
1. Are the numbers of channel listed below an intrinsic property of NVLink3 ? Can we somehow 'visualize' these channels ?
gpu32:22612:22693 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
2. From which part of the log file can we deduce that GPUDirect RDMA is missing ?
gpu32:22612:22693 [0] NCCL INFO Channel 00 : 0[7000] -> 1[b000] via P2P/IPC/read
gpu32:22612:22693 [0] NCCL INFO Channel 00 : 0[7000] -> 8[7000] [send] via NET/IB/1/GDRDMA
gpu37:52125:52256 [0] NCCL INFO Channel 00 : 0[7000] -> 8[7000] [receive] via NET/IB/0
gpu32:22612:22761 [0] NCCL INFO Channel 00 : 0[7000] -> 12[88000] [send] via NET/IB/1/GDRDMA/Shared
gpu37:52129:52281 [4] NCCL INFO Channel 00 : 0[7000] -> 12[88000] [receive] via NET/IB/5/Shared
First, I assume that the index after IB/
corresponds to NIC index. Then Shared
indicates that a non-RDMA channel was used.
In that case, the communication between GPU0 and GPU8 must be through Shared
protocol as well. But it went through a GPUDirect RDMA channel.
3. Does UCX affect the result of the test ? My understanding is that UCX should have no effect. There are certain overlap between UCX and NCCL. The former also provides cuda-optimized transports based on CUDA IPC and GPUDirect RDMA. Here, NCCL only requires MPI for basic communication setup, i.e. it is initialized way ahead of UCX components.
4. How can we interpret the discrepancy between AlltoALL and Allreduce ?
With nv_peer_mem
correctly loaded, we archive respectable bandwidths for w.r.t to your S51111 presentation at GTC 2023.
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
...
1073741824 16777216 float none -1 51414 20.88 19.58 0 51191 20.98 19.66 N/A
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
...
1073741824 268435456 float sum -1 22490 47.74 89.52 0 22465 47.80 89.62 0
For AlltoAll
and Allreduce
respectively. The Allreduce
b/w aligns well with NICs' bandwidth of 100 GB/s. Is there a fundamental reason that AlltoAll
cannot fully saturate the NICs ?
5. Are nv_peer_mem.ko
and nvidia-peermem.ko
used interchangeably ?
Here is another point that we confused from the official GPUDirect documentation.
The former is provided by Mellanox, and hosted here on github. As we understand it requires MLNX_OFED 5.1
The later is provided by NVIDIA driver since v470. Unlike Mellanox, it seems to support older MLNX_OFED.
Please advise which one should we use since the development has stopped at Mellanox's end.
Thanks.
/GDRDMA
then it's a clear sign there is no GPU Direct RDMA. That's what made me think GPU Direct RDMA was missing.
"Then Shared indicates that a non-RDMA channel was used." No, it just means the connection is using shared buffers (for alltoall, we do not allocate one buffer per GPU pair but use a shared pool instead).Thanks for clarification regarding (1) and (3).
If for a given node, not a single path says /GDRDMA then it's a clear sign there is no GPU Direct RDMA. That's what made me think GPU Direct RDMA was missing. "Then Shared indicates that a non-RDMA channel was used." No, it just means the connection is using shared buffers (for alltoall, we do not allocate one buffer per GPU pair but use a shared pool instead).
I understand now since GDRDMA
was completely missing on gpu37.
Indeed when comparing the debug outputs, Shared
only presents in alltoall
, and not in all_reduce
, as you pointed out.
I'd run allreduce from 8B to 16G to really see what the peak bandwidth is; it should be able to reach 96GB/s on an IB system. 89 is a bit sub-par but it could be due to the 1G size, or to the fact you only run one size and the clocks don't have time to ramp-up.
Sorry for causing confusion. I scanned from 16B up to 1G, but truncated to show only bandwidth at 1 G. I've rerun as suggested.
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
16 4 float sum -1 29.55 0.00 0.00 0 28.19 0.00 0.00 0
32 8 float sum -1 28.70 0.00 0.00 0 26.80 0.00 0.00 0
64 16 float sum -1 27.46 0.00 0.00 0 27.87 0.00 0.00 0
128 32 float sum -1 28.21 0.00 0.01 0 27.45 0.00 0.01 0
256 64 float sum -1 28.44 0.01 0.02 0 28.08 0.01 0.02 0
512 128 float sum -1 30.89 0.02 0.03 0 31.08 0.02 0.03 0
1024 256 float sum -1 31.40 0.03 0.06 0 31.53 0.03 0.06 0
2048 512 float sum -1 33.84 0.06 0.11 0 33.14 0.06 0.12 0
4096 1024 float sum -1 35.96 0.11 0.21 0 33.88 0.12 0.23 0
8192 2048 float sum -1 38.91 0.21 0.39 0 36.70 0.22 0.42 0
16384 4096 float sum -1 42.69 0.38 0.72 0 40.15 0.41 0.77 0
32768 8192 float sum -1 45.31 0.72 1.36 0 45.38 0.72 1.35 0
65536 16384 float sum -1 50.93 1.29 2.41 0 49.81 1.32 2.47 0
131072 32768 float sum -1 71.16 1.84 3.45 0 67.73 1.94 3.63 0
262144 65536 float sum -1 66.46 3.94 7.40 0 64.95 4.04 7.57 0
524288 131072 float sum -1 75.59 6.94 13.01 0 75.77 6.92 12.97 0
1048576 262144 float sum -1 96.34 10.88 20.41 0 96.04 10.92 20.47 0
2097152 524288 float sum -1 134.4 15.60 29.25 0 120.0 17.48 32.77 0
4194304 1048576 float sum -1 161.2 26.03 48.80 0 161.3 26.01 48.77 0
8388608 2097152 float sum -1 232.0 36.16 67.80 0 232.5 36.08 67.65 0
16777216 4194304 float sum -1 408.2 41.10 77.05 0 393.8 42.60 79.87 0
33554432 8388608 float sum -1 770.8 43.53 81.62 0 794.7 42.22 79.17 0
67108864 16777216 float sum -1 1380.8 48.60 91.13 0 1393.1 48.17 90.33 0
134217728 33554432 float sum -1 2759.5 48.64 91.20 0 2697.1 49.76 93.31 0
268435456 67108864 float sum -1 5679.0 47.27 88.63 0 5677.3 47.28 88.65 0
536870912 134217728 float sum -1 11296 47.53 89.11 0 11277 47.61 89.26 0
1073741824 268435456 float sum -1 22545 47.63 89.30 0 22555 47.61 89.26 0
2147483648 536870912 float sum -1 44954 47.77 89.57 0 44983 47.74 89.51 0
4294967296 1073741824 float sum -1 89861 47.80 89.62 0 89837 47.81 89.64 0
8589934592 2147483648 float sum -1 179715 47.80 89.62 0 179663 47.81 89.65 0
17179869184 4294967296 float sum -1 359659 47.77 89.56 0 359633 47.77 89.57 0
The peak bandwidth is indeed 91.2 GB/s at 128 MB with v1.1 of GPUDirect kernel module. To reach 96 GB/s, we will try again with latest version (v1.3).
I think you're right, nv_peer_mem was the former module which we used for a very long time but which should be replaced by nvidia-peermem. I'm not authoritative on this point however; that question should be asked to the network support team.
We will contact NVIDIA forum for further clarification.
Updating GPU Direct RDMA won't help performance. It either works or it doesn't. I'd first suggest making sure the firmware of the NIC is up-to-date. What is the NIC type? Is it CX-5 or CX-6? Depending on the NIC model and firwmare level, you may need to increase MAX_ACC_OUT_READ to 44. You can check this page for how to set that, or simply reach out to network support.
Sorry for belated follow up.
We are using CX-6
lspci | grep Connect
0e:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
The MAX_ACC_OUT_READ
had been raised to 44
per your suggestion for all PCIe adapters.
The performance now is ~ 93.25 GB/s, which I believe to be within the margin of error.
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
...
33554432 8388608 float sum -1 765.9 43.81 82.15 0 753.2 44.55 83.53 0
67108864 16777216 float sum -1 1367.2 49.09 92.04 0 1362.3 49.26 92.36 0
134217728 33554432 float sum -1 2698.7 49.73 93.25 0 2695.6 49.79 93.36 0
268435456 67108864 float sum -1 5665.7 47.38 88.84 0 5663.7 47.40 88.87 0
536870912 134217728 float sum -1 11274 47.62 89.29 0 11271 47.63 89.31 0
1073741824 268435456 float sum -1 22490 47.74 89.52 0 22489 47.75 89.52 0
2147483648 536870912 float sum -1 44914 47.81 89.65 0 44905 47.82 89.67 0
4294967296 1073741824 float sum -1 89759 47.85 89.72 0 89764 47.85 89.71 0
8589934592 2147483648 float sum -1 179483 47.86 89.74 0 179523 47.85 89.72 0
17179869184 4294967296 float sum -1 359425 47.80 89.62 0 359124 47.84 89.70 0
We are using 2.11.4
though.
According to release note 2.12
brings addition performance to AlltoAll and Allreduce.
https://developer.nvidia.com/blog/doubling-all2all-performance-with-nvidia-collective-communication-library-2-12/
Since we are using 2:1 GPU:NIC topology, this does not apply. Is this correct understanding ?
To add to the previous message, we have tested again with NCCL 2.13.4 from NVHPC 22.7
The bandwidth for AlltoAll
and Allreduce
are 21 GB/s and 93 GB/s, respectively.
So we do not see a big difference from PXN feature.
PXN will not improve peak bandwidth if peak bandwidth is already perfect. It will only improve the time for small operations (base latency) thanks to message aggregation.
It looks like all looks good now. Feel free to close the bug if so. Thanks!
Thanks for explanation.
Before closing the issue, there is one more issue that I hope you can clarify.
I am trying to derive the formal relationship between algbw
and busbw
for Alltoall case.
cclGroupStart();
for (int r=0; r<nranks; r++) {
ncclSend(sendbuff[r], sendcount, sendtype, r, comm, stream);
ncclRecv(recvbuff[r], recvcount, recvtype, r, comm, stream);
}
ncclGroupEnd();
cclGroup
is $(k-1)\frac{N}{B}$ ($S$ is simply cancelled out)However, from you previous explanation:
Given on 2 nodes, 50% of the data goes through NVLink and 50% goes through the network, then the perceived performance of alltoall will be at most 24GB/s. As you scale to larger number of nodes, the bandwidth should go down to 12 GB/s.
The correct coefficient should be $k/(k-1)$ instead.
The coefficient is k/(k-1) as NCCL perf tests do not assume any specific topology; instead they consider flat topologies.
That works great on a single node with e.g. nvswitch where all GPUs access all others through NVSwitch. But when running on e.g. 2 nodes, the reality is that 7/16 go through NVLink and are not the bottleneck, and 8/16 go through the network and are the bottleneck. So in that case, I'd look at the algbw and divide it by 2 to get the real bw on the network.
Thanks. I understand it clearly now.
The $\frac{n}{n-1}$ ecoefficiency can be derived from your phenomenological explanation regarding the ratio of IB/NVLINK interconnect.
For reference, I amend the derivation from my previous message.
Unless you have further comments, I would like to close this issue later. Thanks very much for your time. I've learnt a lot through this valuable discussion.
Hello,
I appreciate if you could share some insights on the origin of poor internode alltoall performance on our A100 system.
System spec: OEM: XL675d CPU: 2 x AMD EPYC 7543 GPU: 8 x A100-SMX4 (Driver: 510.47.03) Interconnect: 4 x HDR 200Gbps OS: CentOS Linux release 7.9.2009 (Core) Env: gcc/10.2.0, cuda/11.4, cudampi/openmpi-4.1.1 (with UCX 11.2)
GPUDirect RDMA: enabled
PCI Access Control Services (ACS): disabled
A100 topology
NCCL debug and topology dump debug.txt topology.txt
The tests were built with
MPI=1
all_reduce perf: 30 GB/s vs. Theoretical 100 GB/s (25GB/s per HDR)
nThread 1 nGpus 1 minBytes 16 maxBytes 4294967296 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
Using devices
Rank 0 Group 0 Pid 47689 on gpu32 device 0 [0x07] NVIDIA A100-SXM4-80GB
Rank 1 Group 0 Pid 47690 on gpu32 device 1 [0x0b] NVIDIA A100-SXM4-80GB
Rank 2 Group 0 Pid 47691 on gpu32 device 2 [0x48] NVIDIA A100-SXM4-80GB
Rank 3 Group 0 Pid 47692 on gpu32 device 3 [0x4c] NVIDIA A100-SXM4-80GB
Rank 4 Group 0 Pid 47693 on gpu32 device 4 [0x88] NVIDIA A100-SXM4-80GB
Rank 5 Group 0 Pid 47694 on gpu32 device 5 [0x8b] NVIDIA A100-SXM4-80GB
Rank 6 Group 0 Pid 47695 on gpu32 device 6 [0xc8] NVIDIA A100-SXM4-80GB
Rank 7 Group 0 Pid 47696 on gpu32 device 7 [0xcb] NVIDIA A100-SXM4-80GB
Rank 8 Group 0 Pid 40144 on gpu37 device 0 [0x07] NVIDIA A100-SXM4-80GB
Rank 9 Group 0 Pid 40145 on gpu37 device 1 [0x0b] NVIDIA A100-SXM4-80GB
Rank 10 Group 0 Pid 40146 on gpu37 device 2 [0x48] NVIDIA A100-SXM4-80GB
Rank 11 Group 0 Pid 40147 on gpu37 device 3 [0x4c] NVIDIA A100-SXM4-80GB
Rank 12 Group 0 Pid 40148 on gpu37 device 4 [0x88] NVIDIA A100-SXM4-80GB
Rank 13 Group 0 Pid 40149 on gpu37 device 5 [0x8b] NVIDIA A100-SXM4-80GB
Rank 14 Group 0 Pid 40150 on gpu37 device 6 [0xc8] NVIDIA A100-SXM4-80GB
Rank 15 Group 0 Pid 40151 on gpu37 device 7 [0xcb] NVIDIA A100-SXM4-80GB
NCCL version 2.11.4+cuda11.4 #
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
268435456 67108864 float sum -1 15497 17.32 32.48 0 15415 17.41 32.65 0 1073741824 268435456 float sum -1 62740 17.11 32.09 0 62765 17.11 32.08 0 4294967296 1073741824 float sum -1 270855 15.86 29.73 0 271182 15.84 29.70 0
Out of bounds values : 0 OK
Avg bus bandwidth : 12.9065
#
$ mpirun \ -np 16 \ --hostfile hostfile_16 \ ../build/alltoall_perf \ -b 16 -e 4G -f 4 -g 1
nThread 1 nGpus 1 minBytes 16 maxBytes 4294967296 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
Using devices
Rank 0 Group 0 Pid 34180 on gpu32 device 0 [0x07] NVIDIA A100-SXM4-80GB
Rank 1 Group 0 Pid 34181 on gpu32 device 1 [0x0b] NVIDIA A100-SXM4-80GB
Rank 2 Group 0 Pid 34182 on gpu32 device 2 [0x48] NVIDIA A100-SXM4-80GB
Rank 3 Group 0 Pid 34183 on gpu32 device 3 [0x4c] NVIDIA A100-SXM4-80GB
Rank 4 Group 0 Pid 34184 on gpu32 device 4 [0x88] NVIDIA A100-SXM4-80GB
Rank 5 Group 0 Pid 34185 on gpu32 device 5 [0x8b] NVIDIA A100-SXM4-80GB
Rank 6 Group 0 Pid 34186 on gpu32 device 6 [0xc8] NVIDIA A100-SXM4-80GB
Rank 7 Group 0 Pid 34187 on gpu32 device 7 [0xcb] NVIDIA A100-SXM4-80GB
Rank 8 Group 0 Pid 23210 on gpu37 device 0 [0x07] NVIDIA A100-SXM4-80GB
Rank 9 Group 0 Pid 23211 on gpu37 device 1 [0x0b] NVIDIA A100-SXM4-80GB
Rank 10 Group 0 Pid 23212 on gpu37 device 2 [0x48] NVIDIA A100-SXM4-80GB
Rank 11 Group 0 Pid 23213 on gpu37 device 3 [0x4c] NVIDIA A100-SXM4-80GB
Rank 12 Group 0 Pid 23214 on gpu37 device 4 [0x88] NVIDIA A100-SXM4-80GB
Rank 13 Group 0 Pid 23215 on gpu37 device 5 [0x8b] NVIDIA A100-SXM4-80GB
Rank 14 Group 0 Pid 23216 on gpu37 device 6 [0xc8] NVIDIA A100-SXM4-80GB
Rank 15 Group 0 Pid 23217 on gpu37 device 7 [0xcb] NVIDIA A100-SXM4-80GB
NCCL version 2.11.4+cuda11.4 #
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
268435456 4194304 float none -1 57613 4.66 4.37 0 57768 4.65 4.36 N/A 1073741824 16777216 float none -1 227992 4.71 4.42 0 228250 4.70 4.41 N/A 4294967296 67108864 float none -1 934563 4.60 4.31 0 935303 4.59 4.31 N/A
Out of bounds values : 0 OK
Avg bus bandwidth : 2.00892
#