Open zarzen opened 4 years ago
When using more GPUs, we also use more resources on the node : more PCI lanes, more CPU cores, etc ... So it can be slower to have data going in and out from the same GPU and sent and received by the same CPU than having data go in one GPU, traverse the node through NVLink (which is nowhere near being a bottleneck here) then exit the node from another GPU, using another CPU to send through the network.
Note that if the bottleneck is the CPU, increasing NCCL_SOCKET_NTHREADS
should help.
When using more GPUs, we also use more resources on the node : more PCI lanes, more CPU cores, etc ... So it can be slower to have data going in and out from the same GPU and sent and received by the same CPU than having data go in one GPU, traverse the node through NVLink (which is nowhere near being a bottleneck here) then exit the node from another GPU, using another CPU to send through the network.
Note that if the bottleneck is the CPU, increasing
NCCL_SOCKET_NTHREADS
should help.
But here more GPUs results in better network bandwidth utilization actually, which seems counter-intuitive to me. As you said, with more GPUs, we use more resources on the node, then ranks are going to compete for resources and result in slower data in/out from one GPU. so 1 GPU on each node should perform better than 8 GPUs on each node right? (while it is not)
No, using more cores or more PCI buses can help lowering the load on each. For example, each GPU will either receive from PCI and send through NVLink or receive from NVLink and send through PCI, but never receive from PCI and send through PCI which can cause lower bandwidth.
In the precise example of TCP/IP traffic, with 2+ GPUs, you will use 2x more cores for network processing compared to using a single GPU as each GPU has its own set of network threads. This can make a significant performance difference if that's the bottleneck.
Sorry I didn't fully get the point here.
For a ring structure like this: Even though each GPU didn't receive from PCI and send through PCI, but GPU0 and GPU1 still share the same PCI. Is there a big difference comparing to send/recv through PCI on the same GPU?
For the second example, do you mean for a single GPU on each Node, by default there is only two socket threads (aws environment) serving both send and recv. While if have 2+ GPUs on each node, there will be two threads for GPU0 to send the data, and two threads for GPU1 to receive data.
I'm not sure why you say they share the same PCI. On the NIC side yes, they do. On the GPU side they don't. And sometimes (especially on 8 GPUs) GPU 0 and GPU 7 are on different CPU sockets so they use different CPU memory banks which also are a regular bottleneck. That can also mean splitting interrupts on different sockets. Now I'm not saying that's the reason for your performance difference, just that it's often the case on different platforms that using more GPUs means getting higher bandwidth.
For your problem, yes, running on 2+ GPUs might mean doubling the effective number of threads sending/receiving data. I'd need to look at the code again, and it might have changed since 2.4, but it could well be the case. Running top
you should be able to see that easily.
Set the NCCL_SOCKET_NTHREADS=4
or NCCL_SOCKET_NTHREADS=6
didn't seem to make the bus bandwidth getting better for 2GPUs on two nodes. Following is the log for using NCCL_SOCKET_NTHREADS=6
$ /opt/amazon/openmpi/bin/mpirun -np 2 -H ip-172-31-0-112:1,ip-172-31-13-129:1 \
> -bind-to none -map-by slot \
> -x PATH=/opt/amazon/openmpi/bin:$PATH \
> -x NCCL_DEBUG=INFO \
> -x NCCL_SOCKET_NTHREADS=6 \
> -x NCCL_TREE_THRESHOLD=0 \
> -x LD_LIBRARY_PATH=/home/ubuntu/nccl/build/lib:$LD_LIBRARY_PATH \
> -mca btl ^openib \
> -mca btl_tcp_if_exclude lo,docker0 \
> /home/ubuntu/nccl-tests/build/all_reduce_perf -b 8M -e 256M -c 1 -f 2 -n 50
# nThread 1 nGpus 1 minBytes 8388608 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 50 validation: 1
#
# Using devices
# Rank 0 Pid 7846 on ip-172-31-0-112 device 0 [0x00] Tesla V100-SXM2-32GB
# Rank 1 Pid 7998 on ip-172-31-13-129 device 0 [0x00] Tesla V100-SXM2-32GB
ip-172-31-0-112:7846:7846 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>
ip-172-31-0-112:7846:7846 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported
ip-172-31-0-112:7846:7846 [0] NCCL INFO NET/IB : No device found.
ip-172-31-0-112:7846:7846 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0>
NCCL version 2.4.8+cuda10.1
ip-172-31-13-129:7998:7998 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>
ip-172-31-13-129:7998:7998 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported
ip-172-31-13-129:7998:7998 [0] NCCL INFO NET/IB : No device found.
ip-172-31-13-129:7998:7998 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0>
ip-172-31-0-112:7846:7870 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff
ip-172-31-0-112:7846:7870 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0.
ip-172-31-13-129:7998:8021 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff
ip-172-31-13-129:7998:8021 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0.
ip-172-31-0-112:7846:7870 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB
ip-172-31-13-129:7998:8021 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB
ip-172-31-0-112:7846:7870 [0] NCCL INFO Channel 00 : 0 1
ip-172-31-0-112:7846:7870 [0] NCCL INFO Channel 01 : 0 1
ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
ip-172-31-0-112:7846:7870 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6.
ip-172-31-0-112:7846:7870 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread
ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0
ip-172-31-13-129:7998:8021 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6.
ip-172-31-13-129:7998:8021 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread
ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0
ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 01 : 1 -> 0 [receive] via NET/Socket/0
ip-172-31-0-112:7846:7870 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread
ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 01 : 0 -> 1 [receive] via NET/Socket/0
ip-172-31-13-129:7998:8021 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread
ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0
ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/0
ip-172-31-0-112:7846:7870 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
ip-172-31-0-112:7846:7870 [0] NCCL INFO comm 0x7f4d70002350 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
ip-172-31-13-129:7998:8021 [0] NCCL INFO comm 0x7fb80c002350 rank 1 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
ip-172-31-0-112:7846:7846 [0] NCCL INFO Launch mode Parallel
8388608 2097152 float sum 1856.6 4.52 4.52 0e+00 1864.3 4.50 4.50 0e+00
16777216 4194304 float sum 3585.0 4.68 4.68 0e+00 3563.1 4.71 4.71 0e+00
33554432 8388608 float sum 6987.2 4.80 4.80 0e+00 6973.5 4.81 4.81 0e+00
67108864 16777216 float sum 13819 4.86 4.86 0e+00 13678 4.91 4.91 0e+00
134217728 33554432 float sum 27567 4.87 4.87 0e+00 27780 4.83 4.83 0e+00
268435456 67108864 float sum 56147 4.78 4.78 0e+00 55876 4.80 4.80 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 4.75571
#
If you are after for bandwidth, shouldn't you use the AWS EFA adapter by adding; -x FI_PROVIDER="efa" ?
On Wed, 7 Oct 2020 at 06:14, Zhang Zhen notifications@github.com wrote:
Set the NCCL_SOCKET_NTHREADS=4 or NCCL_SOCKET_NTHREADS=6 didn't seem to make the bus bandwidth getting better for 2GPUs on two nodes. Following is the log for using NCCL_SOCKET_NTHREADS=6
$ /opt/amazon/openmpi/bin/mpirun -np 2 -H ip-172-31-0-112:1,ip-172-31-13-129:1 \
-bind-to none -map-by slot \ -x PATH=/opt/amazon/openmpi/bin:$PATH \ -x NCCL_DEBUG=INFO \ -x NCCL_SOCKET_NTHREADS=6 \ -x NCCL_TREE_THRESHOLD=0 \ -x LD_LIBRARY_PATH=/home/ubuntu/nccl/build/lib:$LD_LIBRARY_PATH \ -mca btl ^openib \ -mca btl_tcp_if_exclude lo,docker0 \ /home/ubuntu/nccl-tests/build/all_reduce_perf -b 8M -e 256M -c 1 -f 2 -n 50
nThread 1 nGpus 1 minBytes 8388608 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 50 validation: 1
#
Using devices
Rank 0 Pid 7846 on ip-172-31-0-112 device 0 [0x00] Tesla V100-SXM2-32GB
Rank 1 Pid 7998 on ip-172-31-13-129 device 0 [0x00] Tesla V100-SXM2-32GB
ip-172-31-0-112:7846:7846 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>
ip-172-31-0-112:7846:7846 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:7846:7846 [0] NCCL INFO NET/IB : No device found. ip-172-31-0-112:7846:7846 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> NCCL version 2.4.8+cuda10.1 ip-172-31-13-129:7998:7998 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>
ip-172-31-13-129:7998:7998 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:7998:7998 [0] NCCL INFO NET/IB : No device found. ip-172-31-13-129:7998:7998 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-0-112:7846:7870 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:7846:7870 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:7998:8021 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:7998:8021 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:7846:7870 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB ip-172-31-13-129:7998:8021 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB ip-172-31-0-112:7846:7870 [0] NCCL INFO Channel 00 : 0 1 ip-172-31-0-112:7846:7870 [0] NCCL INFO Channel 01 : 0 1 ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0 ip-172-31-0-112:7846:7870 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6. ip-172-31-0-112:7846:7870 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0 ip-172-31-13-129:7998:8021 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6. ip-172-31-13-129:7998:8021 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0 ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0 ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 01 : 1 -> 0 [receive] via NET/Socket/0 ip-172-31-0-112:7846:7870 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 01 : 0 -> 1 [receive] via NET/Socket/0 ip-172-31-13-129:7998:8021 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0 ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/0 ip-172-31-0-112:7846:7870 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled ip-172-31-0-112:7846:7870 [0] NCCL INFO comm 0x7f4d70002350 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE ip-172-31-13-129:7998:8021 [0] NCCL INFO comm 0x7fb80c002350 rank 1 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE #
out-of-place in-place
size count type redop time algbw busbw error time algbw busbw error
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
ip-172-31-0-112:7846:7846 [0] NCCL INFO Launch mode Parallel 8388608 2097152 float sum 1856.6 4.52 4.52 0e+00 1864.3 4.50 4.50 0e+00 16777216 4194304 float sum 3585.0 4.68 4.68 0e+00 3563.1 4.71 4.71 0e+00 33554432 8388608 float sum 6987.2 4.80 4.80 0e+00 6973.5 4.81 4.81 0e+00 67108864 16777216 float sum 13819 4.86 4.86 0e+00 13678 4.91 4.91 0e+00 134217728 33554432 float sum 27567 4.87 4.87 0e+00 27780 4.83 4.83 0e+00 268435456 67108864 float sum 56147 4.78 4.78 0e+00 55876 4.80 4.80 0e+00
Out of bounds values : 0 OK
Avg bus bandwidth : 4.75571
#
For comparison here is the following is the log for 16 GPUs
/opt/amazon/openmpi/bin/mpirun -np 16 -H ip-172-31-0-112:8,ip-172-31-13-129:8 > -bind-to none -map-by slot \
-x PATH=/opt/amazon/openmpi/bin:$PATH \ -x NCCL_DEBUG=INFO \ -x NCCL_SOCKET_NTHREADS=6 \ -x NCCL_TREE_THRESHOLD=0 \ -x LD_LIBRARY_PATH=/home/ubuntu/nccl/build/lib:$LD_LIBRARY_PATH \ -mca btl ^openib \ -mca btl_tcp_if_exclude lo,docker0 \ /home/ubuntu/nccl-tests/build/all_reduce_perf -b 8M -e 256M -c 1 -f 2 -n 50
nThread 1 nGpus 1 minBytes 8388608 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 50 validation: 1
#
Using devices
Rank 0 Pid 8159 on ip-172-31-0-112 device 0 [0x00] Tesla V100-SXM2-32GB
Rank 1 Pid 8160 on ip-172-31-0-112 device 1 [0x00] Tesla V100-SXM2-32GB
Rank 2 Pid 8161 on ip-172-31-0-112 device 2 [0x00] Tesla V100-SXM2-32GB
Rank 3 Pid 8162 on ip-172-31-0-112 device 3 [0x00] Tesla V100-SXM2-32GB
Rank 4 Pid 8163 on ip-172-31-0-112 device 4 [0x00] Tesla V100-SXM2-32GB
Rank 5 Pid 8164 on ip-172-31-0-112 device 5 [0x00] Tesla V100-SXM2-32GB
Rank 6 Pid 8165 on ip-172-31-0-112 device 6 [0x00] Tesla V100-SXM2-32GB
Rank 7 Pid 8166 on ip-172-31-0-112 device 7 [0x00] Tesla V100-SXM2-32GB
Rank 8 Pid 8537 on ip-172-31-13-129 device 0 [0x00] Tesla V100-SXM2-32GB
Rank 9 Pid 8538 on ip-172-31-13-129 device 1 [0x00] Tesla V100-SXM2-32GB
Rank 10 Pid 8539 on ip-172-31-13-129 device 2 [0x00] Tesla V100-SXM2-32GB
Rank 11 Pid 8540 on ip-172-31-13-129 device 3 [0x00] Tesla V100-SXM2-32GB
Rank 12 Pid 8541 on ip-172-31-13-129 device 4 [0x00] Tesla V100-SXM2-32GB
Rank 13 Pid 8542 on ip-172-31-13-129 device 5 [0x00] Tesla V100-SXM2-32GB
Rank 14 Pid 8543 on ip-172-31-13-129 device 6 [0x00] Tesla V100-SXM2-32GB
Rank 15 Pid 8544 on ip-172-31-13-129 device 7 [0x00] Tesla V100-SXM2-32GB
ip-172-31-0-112:8159:8159 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>
ip-172-31-0-112:8159:8159 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:8159:8159 [0] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8159:8159 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> NCCL version 2.4.8+cuda10.1 ip-172-31-13-129:8537:8537 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>
ip-172-31-13-129:8537:8537 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8537:8537 [0] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8537:8537 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8543:8543 [6] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>
ip-172-31-13-129:8543:8543 [6] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8543:8543 [6] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8543:8543 [6] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8541:8541 [4] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>
ip-172-31-13-129:8541:8541 [4] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8538:8538 [1] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8541:8541 [4] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8541:8541 [4] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0>
ip-172-31-13-129:8538:8538 [1] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8538:8538 [1] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8538:8538 [1] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-0-112:8161:8161 [2] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>
ip-172-31-0-112:8161:8161 [2] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8542:8542 [5] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0> ip-172-31-0-112:8161:8161 [2] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8161:8161 [2] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0>
ip-172-31-13-129:8542:8542 [5] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8542:8542 [5] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8542:8542 [5] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-0-112:8165:8165 [6] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0> ip-172-31-13-129:8544:8544 [7] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8539:8539 [2] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8540:8540 [3] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>
ip-172-31-0-112:8165:8165 [6] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:8165:8165 [6] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8165:8165 [6] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0>
ip-172-31-13-129:8544:8544 [7] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported
ip-172-31-13-129:8540:8540 [3] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported
ip-172-31-13-129:8539:8539 [2] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8544:8544 [7] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8544:8544 [7] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8539:8539 [2] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8540:8540 [3] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8539:8539 [2] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8540:8540 [3] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-0-112:8160:8160 [1] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8163:8163 [4] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>
ip-172-31-0-112:8160:8160 [1] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported
ip-172-31-0-112:8163:8163 [4] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:8160:8160 [1] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8163:8163 [4] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8160:8160 [1] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8163:8163 [4] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8162:8162 [3] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8166:8166 [7] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8164:8164 [5] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>
ip-172-31-0-112:8162:8162 [3] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported
ip-172-31-0-112:8166:8166 [7] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:8162:8162 [3] NCCL INFO NET/IB : No device found.
ip-172-31-0-112:8164:8164 [5] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:8162:8162 [3] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8166:8166 [7] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8166:8166 [7] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8164:8164 [5] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8164:8164 [5] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8159:8218 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8159:8218 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8161:8219 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8161:8219 [2] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8165:8220 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8165:8220 [6] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8537:8596 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8537:8596 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8162:8221 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8162:8221 [3] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8163:8222 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8163:8222 [4] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8160:8223 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8160:8223 [1] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8543:8597 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8166:8224 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8166:8224 [7] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8543:8597 [6] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8164:8225 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8164:8225 [5] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8541:8598 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8541:8598 [4] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8542:8600 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8542:8600 [5] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8544:8601 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8544:8601 [7] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8538:8599 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8538:8599 [1] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8539:8603 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8539:8603 [2] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8540:8602 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8540:8602 [3] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8544:8601 [7] NCCL INFO CUDA Dev 7[7], Socket NIC distance : PHB ip-172-31-13-129:8543:8597 [6] NCCL INFO CUDA Dev 6[6], Socket NIC distance : PHB ip-172-31-13-129:8542:8600 [5] NCCL INFO CUDA Dev 5[5], Socket NIC distance : PHB ip-172-31-13-129:8541:8598 [4] NCCL INFO CUDA Dev 4[4], Socket NIC distance : PHB ip-172-31-13-129:8540:8602 [3] NCCL INFO CUDA Dev 3[3], Socket NIC distance : PHB ip-172-31-13-129:8539:8603 [2] NCCL INFO CUDA Dev 2[2], Socket NIC distance : PHB ip-172-31-13-129:8537:8596 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB ip-172-31-13-129:8538:8599 [1] NCCL INFO CUDA Dev 1[1], Socket NIC distance : PHB ip-172-31-0-112:8160:8223 [1] NCCL INFO CUDA Dev 1[1], Socket NIC distance : PHB ip-172-31-0-112:8164:8225 [5] NCCL INFO CUDA Dev 5[5], Socket NIC distance : PHB ip-172-31-0-112:8166:8224 [7] NCCL INFO CUDA Dev 7[7], Socket NIC distance : PHB ip-172-31-0-112:8162:8221 [3] NCCL INFO CUDA Dev 3[3], Socket NIC distance : PHB ip-172-31-0-112:8161:8219 [2] NCCL INFO CUDA Dev 2[2], Socket NIC distance : PHB ip-172-31-0-112:8159:8218 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB ip-172-31-0-112:8163:8222 [4] NCCL INFO CUDA Dev 4[4], Socket NIC distance : PHB ip-172-31-0-112:8165:8220 [6] NCCL INFO CUDA Dev 6[6], Socket NIC distance : PHB ip-172-31-0-112:8159:8218 [0] NCCL INFO Channel 00 : 0 1 3 2 6 4 5 7 8 9 11 10 14 12 13 15 ip-172-31-0-112:8159:8218 [0] NCCL INFO Channel 01 : 0 1 3 2 6 4 5 7 8 9 11 10 14 12 13 15 ip-172-31-0-112:8159:8218 [0] NCCL INFO Ring 00 : 15 -> 0 [receive] via NET/Socket/0 ip-172-31-0-112:8159:8218 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6. ip-172-31-0-112:8159:8218 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-13-129:8537:8596 [0] NCCL INFO Ring 00 : 7 -> 8 [receive] via NET/Socket/0 ip-172-31-13-129:8537:8596 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6. ip-172-31-13-129:8537:8596 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-0-112:8159:8218 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC ip-172-31-13-129:8537:8596 [0] NCCL INFO Ring 00 : 8[0] -> 9[1] via P2P/IPC ip-172-31-0-112:8160:8223 [1] NCCL INFO Ring 00 : 1[1] -> 3[3] via P2P/IPC ip-172-31-0-112:8162:8221 [3] NCCL INFO Ring 00 : 3[3] -> 2[2] via P2P/IPC ip-172-31-13-129:8542:8600 [5] NCCL INFO Ring 00 : 13[5] -> 15[7] via P2P/IPC ip-172-31-13-129:8539:8603 [2] NCCL INFO Ring 00 : 10[2] -> 14[6] via P2P/IPC ip-172-31-13-129:8543:8597 [6] NCCL INFO Ring 00 : 14[6] -> 12[4] via P2P/IPC ip-172-31-0-112:8161:8219 [2] NCCL INFO Ring 00 : 2[2] -> 6[6] via P2P/IPC ip-172-31-13-129:8541:8598 [4] NCCL INFO Ring 00 : 12[4] -> 13[5] via P2P/IPC ip-172-31-0-112:8163:8222 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC ip-172-31-13-129:8538:8599 [1] NCCL INFO Ring 00 : 9[1] -> 11[3] via P2P/IPC ip-172-31-0-112:8164:8225 [5] NCCL INFO Ring 00 : 5[5] -> 7[7] via P2P/IPC ip-172-31-0-112:8165:8220 [6] NCCL INFO Ring 00 : 6[6] -> 4[4] via P2P/IPC ip-172-31-13-129:8540:8602 [3] NCCL INFO Ring 00 : 11[3] -> 10[2] via P2P/IPC ip-172-31-13-129:8544:8601 [7] NCCL INFO Ring 00 : 15 -> 0 [send] via NET/Socket/0 ip-172-31-0-112:8166:8224 [7] NCCL INFO Ring 00 : 7 -> 8 [send] via NET/Socket/0 ip-172-31-13-129:8542:8600 [5] NCCL INFO Ring 01 : 13[5] -> 15[7] via P2P/IPC ip-172-31-13-129:8539:8603 [2] NCCL INFO Ring 01 : 10[2] -> 14[6] via P2P/IPC ip-172-31-13-129:8543:8597 [6] NCCL INFO Ring 01 : 14[6] -> 12[4] via P2P/IPC ip-172-31-13-129:8541:8598 [4] NCCL INFO Ring 01 : 12[4] -> 13[5] via P2P/IPC ip-172-31-13-129:8538:8599 [1] NCCL INFO Ring 01 : 9[1] -> 11[3] via P2P/IPC ip-172-31-13-129:8540:8602 [3] NCCL INFO Ring 01 : 11[3] -> 10[2] via P2P/IPC ip-172-31-0-112:8160:8223 [1] NCCL INFO Ring 01 : 1[1] -> 3[3] via P2P/IPC ip-172-31-0-112:8162:8221 [3] NCCL INFO Ring 01 : 3[3] -> 2[2] via P2P/IPC ip-172-31-0-112:8161:8219 [2] NCCL INFO Ring 01 : 2[2] -> 6[6] via P2P/IPC ip-172-31-0-112:8163:8222 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC ip-172-31-0-112:8164:8225 [5] NCCL INFO Ring 01 : 5[5] -> 7[7] via P2P/IPC ip-172-31-0-112:8165:8220 [6] NCCL INFO Ring 01 : 6[6] -> 4[4] via P2P/IPC ip-172-31-13-129:8537:8596 [0] NCCL INFO Ring 01 : 7 -> 8 [receive] via NET/Socket/0 ip-172-31-13-129:8537:8596 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-0-112:8159:8218 [0] NCCL INFO Ring 01 : 15 -> 0 [receive] via NET/Socket/0 ip-172-31-0-112:8159:8218 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-13-129:8537:8596 [0] NCCL INFO Ring 01 : 8[0] -> 9[1] via P2P/IPC ip-172-31-0-112:8159:8218 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC ip-172-31-0-112:8162:8221 [3] NCCL INFO comm 0x7f70fc002350 rank 3 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE ip-172-31-13-129:8539:8603 [2] NCCL INFO comm 0x7f4c08002350 rank 10 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE ip-172-31-13-129:8543:8597 [6] NCCL INFO comm 0x7fc340002350 rank 14 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE ip-172-31-13-129:8541:8598 [4] NCCL INFO comm 0x7f050c002350 rank 12 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE ip-172-31-0-112:8161:8219 [2] NCCL INFO comm 0x7fa72c002350 rank 2 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE ip-172-31-0-112:8163:8222 [4] NCCL INFO comm 0x7f4820002350 rank 4 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE ip-172-31-13-129:8540:8602 [3] NCCL INFO comm 0x7f5b8c002350 rank 11 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE ip-172-31-0-112:8165:8220 [6] NCCL INFO comm 0x7fa238002350 rank 6 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE ip-172-31-0-112:8166:8224 [7] NCCL INFO Ring 01 : 7 -> 8 [send] via NET/Socket/0 ip-172-31-0-112:8164:8225 [5] NCCL INFO comm 0x7fe918002350 rank 5 nranks 16 cudaDev 5 nvmlDev 5 - Init COMPLETE ip-172-31-13-129:8538:8599 [1] NCCL INFO comm 0x7fe140002350 rank 9 nranks 16 cudaDev 1 nvmlDev 1 - Init COMPLETE ip-172-31-0-112:8160:8223 [1] NCCL INFO comm 0x7f6a4c002350 rank 1 nranks 16 cudaDev 1 nvmlDev 1 - Init COMPLETE ip-172-31-13-129:8542:8600 [5] NCCL INFO comm 0x7f221c002350 rank 13 nranks 16 cudaDev 5 nvmlDev 5 - Init COMPLETE ip-172-31-13-129:8544:8601 [7] NCCL INFO Ring 01 : 15 -> 0 [send] via NET/Socket/0 ip-172-31-13-129:8537:8596 [0] NCCL INFO comm 0x7fcbf8002350 rank 8 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE ip-172-31-0-112:8159:8218 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled ip-172-31-0-112:8166:8224 [7] NCCL INFO comm 0x7f5180002350 rank 7 nranks 16 cudaDev 7 nvmlDev 7 - Init COMPLETE ip-172-31-0-112:8159:8218 [0] NCCL INFO comm 0x7f956c002350 rank 0 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE ip-172-31-13-129:8544:8601 [7] NCCL INFO comm 0x7fcb50002350 rank 15 nranks 16 cudaDev 7 nvmlDev 7 - Init COMPLETE #
out-of-place in-place
size count type redop time algbw busbw error time algbw busbw error
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
ip-172-31-0-112:8159:8159 [0] NCCL INFO Launch mode Parallel 8388608 2097152 float sum 2390.0 3.51 6.58 5e-07 2383.8 3.52 6.60 5e-07 16777216 4194304 float sum 3892.3 4.31 8.08 5e-07 3986.7 4.21 7.89 5e-07 33554432 8388608 float sum 7081.6 4.74 8.88 5e-07 7038.6 4.77 8.94 5e-07 67108864 16777216 float sum 18020 3.72 6.98 5e-07 18246 3.68 6.90 5e-07 134217728 33554432 float sum 41511 3.23 6.06 5e-07 56540 2.37 4.45 5e-07 268435456 67108864 float sum 73008 3.68 6.89 5e-07 64965 4.13 7.75 5e-07
Out of bounds values : 0 OK
Avg bus bandwidth : 7.16738
#
I am going to measure the PCI bandwidth for bidirectional and unidirectional cases and report later.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/nccl/issues/398#issuecomment-704497082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEWXVQOPA4XCQZ5PEGW3E3SJNUBVANCNFSM4SDMIHQQ .
I am currently targetting TCP, because it is a more general transport. but I do find that, for two nodes with 2GPUs case, the EFA does not perform well either. I will update the log later.
Hi, I tested the PCIe bandwidth, which can give 80Gbps for the bidirectional scenario (with the scripts, and tested on AWS-p3.2xlarge instance). So I think sending and receiving through PCIe of the GPU won't limit the bus bandwidth down to 4.7GB/s.
Hi @sjeaugey
Just noticed on AWS-p3dn the PCIe bandwidth is different comparing to AWS-p3.2xl instances.
If we use synchronized cudaMemcpy
, both DtoH and HtoD can saturate ~50Gbps. Which seems close to what we can observe
But if we pinned the memory and use cudaMallocHost
no matter cudaMemcpyAsync
or cudaMemcpy
, both DtoH and HtoD bandwidth is around ~90Gbps. I checked the code, I think for the network send/recv buffer nccl uses cudaHostAlloc
to get pinned memory.
Hi
To isolate the network communication bottleneck, I modified the NCCL source code to skip network communications. (by letting the ncclSocketTest
set *done=1;
and directly return).
I got the following performance on two p3dn instances(each with 8V100)
nranks | bandwidth (GB/s) |
---|---|
2 | ~ 6.7 |
16 | ~10.8 |
Because I skipped the network operations, the speed for different sizes of all-reduce buffer are almost the same. We can see 16 ranks perform close to max bandwidth over PCIe.
I suspected the nranks * loopSize
is too small in nranks==2
case, but merely increase loopSize
by increasing the NCCL_BUFFSIZE=32MB
didn't contribute to bandwidth.
@wfang recently, there is insufficient capacity for p3dn instances in the region that I can operate on. so I didn't get the bandwidth numbers for EFA. But I remember the bandwidth is around 2.3GB/s, with 2 ranks on 2 p3dn node. Correct me if I am wrong. Thanks!
I think there might be an inefficient pipeline issue when nranks==2
.
The main code snippets inside ringAllreduceKernel (from here) are pasted below.
In when the nranks==2
the second block for recvReduceSend
operations and fourth block for directRecvCopySend
will be skipped.
Thus only prims.send(...)
, prims.directRecvReduceCopySend(...)
and prims.directRecv(...)
functions are invoked. Where only second function call prims.directRecvReduceCopySend(...)
will do postSend
and postRecv
operation at the same time.
Thus, only the second function call uses the bidirectional bandwidth of PCIe.
(based on my current understanding, the prims.send()
, prims.directRecvReduceCopySend(...)
operations are executed in sync manner. )
Thus, in nranks==2
, there is 2/3 of the codes are running with single direction bandwidth usage.
But if nranks>2
, there are more bidirectional bandwidth usage due to prims.recvReduceSend(...)
and prims.directRecvCopySend(...)
. Thus the larger the nranks
the better bus utilization.
hi @sjeaugey , could you please confirm/reject my thoughts?
/////////////// begin AllReduce steps ///////////////
ssize_t offset;
int nelem;
int slice;
// step 0: push data to next GPU
slice = ring->devUserRanks[nranks-1];
offset = chunkOffset + slice * realChunkSize;
nelem = min(realChunkSize, size-offset);
prims.send(thisInput+offset, nelem);
// k-2 steps: reduce and copy to next GPU
for (int j=2; j<nranks; ++j) {
slice = ring->devUserRanks[nranks-j];
offset = chunkOffset + slice * realChunkSize;
nelem = min(realChunkSize, size-offset);
// skipped when nranks == 2,
// but if nranks > 2, more bidirectional operations (postSend, postRecv) are invoked
prims.recvReduceSend(thisInput+offset, nelem);
}
// step k-1: reduce this buffer and data, which will produce the final
// result that we store in this data and push to the next GPU
slice = ring->devUserRanks[0];
offset = chunkOffset + slice * realChunkSize;
nelem = min(realChunkSize, size-offset);
prims.directRecvReduceCopySend(thisInput+offset, thisOutput+offset, offset, nelem);
// k-2 steps: copy to next GPU
for (int j=1; j<nranks-1; ++j) {
slice = ring->devUserRanks[nranks-j];
offset = chunkOffset + slice * realChunkSize;
nelem = min(realChunkSize, size-offset);
prims.directRecvCopySend(thisOutput+offset, offset, nelem);
}
// Make final copy from buffer to dest.
slice = ring->devUserRanks[1];
offset = chunkOffset + slice * realChunkSize;
nelem = min(realChunkSize, size-offset);
// Final wait/copy.
prims.directRecv(thisOutput+offset, offset, nelem);
Yes, that is correct, the 2 rank case has some inefficiency due to the extra copy step. It's not exactly 2/3 though, as the final copy (recv) is supposed to be local, so it should run at 10-20GB/s. And that's why I didn't think about that in your case since your bus bandwidth is much lower than that. Now I realize the bus BW difference isn't actually that large (around 75%) and it could indeed be enough to explain the difference.
Hi @sjeaugey Thanks for your confirmation!
When you mention extra copy step do you mean the copy step inside directRecvReduceCopySend
function?
But I think even in nranks>2
cases, this function still invoked, and more prims.directRecvCopySend
are called.
I suppose those copy latency would be well overlapped? (say, time_of_copy_i will be hidden in time_ofcopy/recv/send{i+1})
BTW, I just want to confirm that does the prims.send()
indeed use single directional bandwidth for sending right?
No I meant the last copy. It is always there, but in general, its effect is negligible. We have 2*(nranks-1) normal copy steps, + 1 last local copy step. With 16 GPUs, that last copy would make the total algorithm 31 steps instead of 30, so at most it would add 1/30th of the time which is +3%; 1.5% considering that step is 2x faster than others steps. With 2 GPUs, it's 3 steps instead of 2, so +50% of the time; or +25% at 2x speed.
That is interesting.
Actually, I have tried commenting out the last copy of the main logic, prims.directRecv()
, with socket speed infinity fast, by skipping the socket request process, I can only achieve 7.3GB/s bus bandwidth.
That's why I am suspecting the pipeline issue for prims.send()
and prims.directRecvReduceCopySend(...)
.
Could you confirm that the prims.send
and prims.directRecvReduceCopySend
functions are not actually well overlapping the send and receive over PCIe? Because I saw inside prims.send
there is a barrier for blocking, but I am not so sure (I am also a newbie to CUDA programming). Thanks!
For large enough sizes, both ranks should call send() at a different offset, using the NIC in both directions, then both should call recvReduceCopySend(), again using the NIC in both directions (I'm ignoring direct as this is irrelevant here). And finally they should call recv() which will as the extra step. I'm not sure how you can comment out the recv() call as it is needed to absorb data from the FIFO. The send() and recvReduceCopySend() calls don't have to be overlapped ; they are running in both directions already.
When both ranks call send(), how could the rank-0 initiate the data receiving operation on NIC? I saw there is a FOR_RECV(postRecv)
here, which check the RECV
flag. The send() function didn't pass the RECV
flag as here.
That's why I thought NIC is not going to receive data at send() stage.
What I do is just comment the last prims.directRecv(thisOutput+offset, offset, nelem);
for skipping the last copy.
The CPU proxy thread knows about the number of steps to perform, so it will initiate the receive even before the GPU arrives in the recv phase.
Commenting out the last copy should not work as it would not acknowledge data. It should cause a hang, at least after some time.
Yes, the program is going to hang, at the end of nccl-tests. nccl-tests can output the bus bandwidth, so I didn't take much attention to the hanging issue.
The problem is that I'm not sure what we are actually measuring at this point.
To remove intra-node or inter-node communication the best is to either comment out this line (for intra-node communication): https://github.com/NVIDIA/nccl/blob/master/src/collectives/device/primitives.h#L195 Or force nbytes to 0 here to remove inter-node communication: https://github.com/NVIDIA/nccl/blob/master/src/collectives/device/primitives.h#L107
If you want to only remove the copy for the last recv() step, you can add an if (RECV == 1 && SEND == 0)
before the call to ReduceOrCopyMulti
.
We are measuring the bus bandwidth here.
The bus bandwidth we would expect is around 11GB/s, due to the PCIe bandwidth limitation.
In the beginning, the socket communication between two nodes is involved, thus we got ~4.5GB/s bandwidth for nranks==2
and ~6GB/s for nranks==16
cases.
To isolate the bottleneck from network communication, we skipped the socket requests by letting the ncclSocketTest
function return the done
signal immediately. Which gives us ~6.5GB/s for nranks==2
and 10.8GB/s for nranks==16
cases.
Which means there are other performance issues for nranks==2
here.
As discussed previously, I think we found one inefficiency when nranks==2
, that the time proportion of the last copy
operation takes ~25% overhead as you pointed out.
While if we take the ~25% as overhead introduced by the last extra copy, then the bus bandwidth should probably be bounded ~8GB/s. Which is larger than ~6.5GB/s, which means there might be other components causing inefficiency in the nranks==2
case.
I know it might not be meaningful to stick with the nranks==2
case, but I feel it's the simplest case to start for understanding the NCCL and the performance.
Thanks for your suggestions! I will do some experiments with those code points.
watching
Hi, I tested NCCL performance on two cloud servers (aws-p3dn instance, which equipped with 8 V100s and 100Gbps network) with nccl-tests. I found an interesting behavior of NCCL. If I launch 8 processes on each server, the bus bandwidth is better than launching one process one each node. The outputs of the nccl-tests are given as following. (The nccl version I am using is v2.4.8, with TCP socket as network transport, and I disabled treeAllreduce by set
NCCL_TREE_THRESHOLD =0
). I am wondering why more processes can result in better bus bandwidth?