NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.26k stars 827 forks source link

More ranks, better bus bandwidth? #398

Open zarzen opened 4 years ago

zarzen commented 4 years ago

Hi, I tested NCCL performance on two cloud servers (aws-p3dn instance, which equipped with 8 V100s and 100Gbps network) with nccl-tests. I found an interesting behavior of NCCL. If I launch 8 processes on each server, the bus bandwidth is better than launching one process one each node. The outputs of the nccl-tests are given as following. (The nccl version I am using is v2.4.8, with TCP socket as network transport, and I disabled treeAllreduce by set NCCL_TREE_THRESHOLD =0). I am wondering why more processes can result in better bus bandwidth?

# 8 processes on each node, 16 in total, the bus can sustain around 6GB/s

# nThread 1 nGpus 1 minBytes 262144 maxBytes 536870912 step: 2(factor) warmup iters: 5 iters: 25 validation: 0 
#
# Using devices
#   Rank  0 Pid   7993 on ip-172-31-3-138 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid   7994 on ip-172-31-3-138 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid   7995 on ip-172-31-3-138 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid   7996 on ip-172-31-3-138 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank  4 Pid   7997 on ip-172-31-3-138 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank  5 Pid   7998 on ip-172-31-3-138 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank  6 Pid   7999 on ip-172-31-3-138 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank  7 Pid   8000 on ip-172-31-3-138 device  7 [0x00] Tesla V100-SXM2-32GB
#   Rank  8 Pid   7459 on ip-172-31-9-173 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  9 Pid   7460 on ip-172-31-9-173 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank 10 Pid   7461 on ip-172-31-9-173 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank 11 Pid   7462 on ip-172-31-9-173 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank 12 Pid   7463 on ip-172-31-9-173 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank 13 Pid   7464 on ip-172-31-9-173 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank 14 Pid   7465 on ip-172-31-9-173 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank 15 Pid   7466 on ip-172-31-9-173 device  7 [0x00] Tesla V100-SXM2-32GB
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      262144         65536   float     sum    641.7    0.41    0.77    N/A    636.0    0.41    0.77    N/A
      524288        131072   float     sum    829.9    0.63    1.18    N/A    823.3    0.64    1.19    N/A
     1048576        262144   float     sum   1637.1    0.64    1.20    N/A   1635.9    0.64    1.20    N/A
     2097152        524288   float     sum   1228.1    1.71    3.20    N/A   1349.7    1.55    2.91    N/A
     4194304       1048576   float     sum   1633.6    2.57    4.81    N/A   1615.9    2.60    4.87    N/A
     8388608       2097152   float     sum   2529.2    3.32    6.22    N/A   2604.9    3.22    6.04    N/A
    16777216       4194304   float     sum   5108.2    3.28    6.16    N/A   4643.2    3.61    6.77    N/A
    33554432       8388608   float     sum   9092.7    3.69    6.92    N/A   9119.4    3.68    6.90    N/A
    67108864      16777216   float     sum    18320    3.66    6.87    N/A    18364    3.65    6.85    N/A
   134217728      33554432   float     sum    50593    2.65    4.97    N/A    41769    3.21    6.03    N/A
   268435456      67108864   float     sum    94181    2.85    5.34    N/A    85884    3.13    5.86    N/A
   536870912     134217728   float     sum   169820    3.16    5.93    N/A   173666    3.09    5.80    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.53219 
#
# 1 process on each node, 2 processes in total, can only saturate around 4.5GB/s

# nThread 1 nGpus 1 minBytes 262144 maxBytes 536870912 step: 2(factor) warmup iters: 5 iters: 25 validation: 0 
#
# Using devices
#   Rank  0 Pid   4429 on ip-172-31-45-72 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid   4511 on ip-172-31-39-197 device  0 [0x00] Tesla V100-SXM2-32GB
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
      262144         65536   float     sum    365.1    0.72    0.72    N/A    353.4    0.74    0.74    N/A
      524288        131072   float     sum    461.9    1.14    1.14    N/A    499.8    1.05    1.05    N/A
     1048576        262144   float     sum    676.5    1.55    1.55    N/A    637.4    1.64    1.64    N/A
     2097152        524288   float     sum    905.1    2.32    2.32    N/A    909.6    2.31    2.31    N/A
     4194304       1048576   float     sum   1587.5    2.64    2.64    N/A   1577.9    2.66    2.66    N/A
     8388608       2097152   float     sum   1953.7    4.29    4.29    N/A   1993.3    4.21    4.21    N/A
    16777216       4194304   float     sum   3867.6    4.34    4.34    N/A   3832.9    4.38    4.38    N/A
    33554432       8388608   float     sum   7547.1    4.45    4.45    N/A   7584.5    4.42    4.42    N/A
    67108864      16777216   float     sum    15011    4.47    4.47    N/A    14978    4.48    4.48    N/A
   134217728      33554432   float     sum    29772    4.51    4.51    N/A    29935    4.48    4.48    N/A
   268435456      67108864   float     sum    59497    4.51    4.51    N/A    59315    4.53    4.53    N/A
   536870912     134217728   float     sum   120031    4.47    4.47    N/A   121255    4.43    4.43    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.2804 
#
sjeaugey commented 4 years ago

When using more GPUs, we also use more resources on the node : more PCI lanes, more CPU cores, etc ... So it can be slower to have data going in and out from the same GPU and sent and received by the same CPU than having data go in one GPU, traverse the node through NVLink (which is nowhere near being a bottleneck here) then exit the node from another GPU, using another CPU to send through the network.

Note that if the bottleneck is the CPU, increasing NCCL_SOCKET_NTHREADS should help.

zarzen commented 4 years ago

When using more GPUs, we also use more resources on the node : more PCI lanes, more CPU cores, etc ... So it can be slower to have data going in and out from the same GPU and sent and received by the same CPU than having data go in one GPU, traverse the node through NVLink (which is nowhere near being a bottleneck here) then exit the node from another GPU, using another CPU to send through the network.

Note that if the bottleneck is the CPU, increasing NCCL_SOCKET_NTHREADS should help.

But here more GPUs results in better network bandwidth utilization actually, which seems counter-intuitive to me. As you said, with more GPUs, we use more resources on the node, then ranks are going to compete for resources and result in slower data in/out from one GPU. so 1 GPU on each node should perform better than 8 GPUs on each node right? (while it is not)

sjeaugey commented 4 years ago

No, using more cores or more PCI buses can help lowering the load on each. For example, each GPU will either receive from PCI and send through NVLink or receive from NVLink and send through PCI, but never receive from PCI and send through PCI which can cause lower bandwidth.

In the precise example of TCP/IP traffic, with 2+ GPUs, you will use 2x more cores for network processing compared to using a single GPU as each GPU has its own set of network threads. This can make a significant performance difference if that's the bottleneck.

zarzen commented 4 years ago

Sorry I didn't fully get the point here.

For a ring structure like this: image Even though each GPU didn't receive from PCI and send through PCI, but GPU0 and GPU1 still share the same PCI. Is there a big difference comparing to send/recv through PCI on the same GPU?

For the second example, do you mean for a single GPU on each Node, by default there is only two socket threads (aws environment) serving both send and recv. While if have 2+ GPUs on each node, there will be two threads for GPU0 to send the data, and two threads for GPU1 to receive data.

sjeaugey commented 4 years ago

I'm not sure why you say they share the same PCI. On the NIC side yes, they do. On the GPU side they don't. And sometimes (especially on 8 GPUs) GPU 0 and GPU 7 are on different CPU sockets so they use different CPU memory banks which also are a regular bottleneck. That can also mean splitting interrupts on different sockets. Now I'm not saying that's the reason for your performance difference, just that it's often the case on different platforms that using more GPUs means getting higher bandwidth.

For your problem, yes, running on 2+ GPUs might mean doubling the effective number of threads sending/receiving data. I'd need to look at the code again, and it might have changed since 2.4, but it could well be the case. Running top you should be able to see that easily.

zarzen commented 4 years ago

Set the NCCL_SOCKET_NTHREADS=4 or NCCL_SOCKET_NTHREADS=6 didn't seem to make the bus bandwidth getting better for 2GPUs on two nodes. Following is the log for using NCCL_SOCKET_NTHREADS=6

$ /opt/amazon/openmpi/bin/mpirun -np 2 -H ip-172-31-0-112:1,ip-172-31-13-129:1 \
>             -bind-to none -map-by slot \
>             -x PATH=/opt/amazon/openmpi/bin:$PATH \
>             -x NCCL_DEBUG=INFO \
>             -x NCCL_SOCKET_NTHREADS=6 \
>             -x NCCL_TREE_THRESHOLD=0 \
>             -x LD_LIBRARY_PATH=/home/ubuntu/nccl/build/lib:$LD_LIBRARY_PATH \
>             -mca btl ^openib \
>             -mca btl_tcp_if_exclude lo,docker0 \
>             /home/ubuntu/nccl-tests/build/all_reduce_perf -b 8M -e 256M -c 1 -f 2 -n 50
# nThread 1 nGpus 1 minBytes 8388608 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 50 validation: 1 
#
# Using devices
#   Rank  0 Pid   7846 on ip-172-31-0-112 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid   7998 on ip-172-31-13-129 device  0 [0x00] Tesla V100-SXM2-32GB
ip-172-31-0-112:7846:7846 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>

ip-172-31-0-112:7846:7846 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported
ip-172-31-0-112:7846:7846 [0] NCCL INFO NET/IB : No device found.
ip-172-31-0-112:7846:7846 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0>
NCCL version 2.4.8+cuda10.1
ip-172-31-13-129:7998:7998 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>

ip-172-31-13-129:7998:7998 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported
ip-172-31-13-129:7998:7998 [0] NCCL INFO NET/IB : No device found.
ip-172-31-13-129:7998:7998 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0>
ip-172-31-0-112:7846:7870 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff
ip-172-31-0-112:7846:7870 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0.
ip-172-31-13-129:7998:8021 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff
ip-172-31-13-129:7998:8021 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0.
ip-172-31-0-112:7846:7870 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
ip-172-31-13-129:7998:8021 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
ip-172-31-0-112:7846:7870 [0] NCCL INFO Channel 00 :    0   1
ip-172-31-0-112:7846:7870 [0] NCCL INFO Channel 01 :    0   1
ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
ip-172-31-0-112:7846:7870 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6.
ip-172-31-0-112:7846:7870 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread
ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0
ip-172-31-13-129:7998:8021 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6.
ip-172-31-13-129:7998:8021 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread
ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0
ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 01 : 1 -> 0 [receive] via NET/Socket/0
ip-172-31-0-112:7846:7870 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread
ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 01 : 0 -> 1 [receive] via NET/Socket/0
ip-172-31-13-129:7998:8021 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread
ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0
ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/0
ip-172-31-0-112:7846:7870 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
ip-172-31-0-112:7846:7870 [0] NCCL INFO comm 0x7f4d70002350 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
ip-172-31-13-129:7998:8021 [0] NCCL INFO comm 0x7fb80c002350 rank 1 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
ip-172-31-0-112:7846:7846 [0] NCCL INFO Launch mode Parallel
     8388608       2097152   float     sum   1856.6    4.52    4.52  0e+00   1864.3    4.50    4.50  0e+00
    16777216       4194304   float     sum   3585.0    4.68    4.68  0e+00   3563.1    4.71    4.71  0e+00
    33554432       8388608   float     sum   6987.2    4.80    4.80  0e+00   6973.5    4.81    4.81  0e+00
    67108864      16777216   float     sum    13819    4.86    4.86  0e+00    13678    4.91    4.91  0e+00
   134217728      33554432   float     sum    27567    4.87    4.87  0e+00    27780    4.83    4.83  0e+00
   268435456      67108864   float     sum    56147    4.78    4.78  0e+00    55876    4.80    4.80  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.75571 
#
wfang commented 4 years ago

If you are after for bandwidth, shouldn't you use the AWS EFA adapter by adding; -x FI_PROVIDER="efa" ?

On Wed, 7 Oct 2020 at 06:14, Zhang Zhen notifications@github.com wrote:

Set the NCCL_SOCKET_NTHREADS=4 or NCCL_SOCKET_NTHREADS=6 didn't seem to make the bus bandwidth getting better for 2GPUs on two nodes. Following is the log for using NCCL_SOCKET_NTHREADS=6

$ /opt/amazon/openmpi/bin/mpirun -np 2 -H ip-172-31-0-112:1,ip-172-31-13-129:1 \

        -bind-to none -map-by slot \
        -x PATH=/opt/amazon/openmpi/bin:$PATH \
        -x NCCL_DEBUG=INFO \
        -x NCCL_SOCKET_NTHREADS=6 \
        -x NCCL_TREE_THRESHOLD=0 \
        -x LD_LIBRARY_PATH=/home/ubuntu/nccl/build/lib:$LD_LIBRARY_PATH \
        -mca btl ^openib \
        -mca btl_tcp_if_exclude lo,docker0 \
        /home/ubuntu/nccl-tests/build/all_reduce_perf -b 8M -e 256M -c 1 -f 2 -n 50

nThread 1 nGpus 1 minBytes 8388608 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 50 validation: 1

#

Using devices

Rank 0 Pid 7846 on ip-172-31-0-112 device 0 [0x00] Tesla V100-SXM2-32GB

Rank 1 Pid 7998 on ip-172-31-13-129 device 0 [0x00] Tesla V100-SXM2-32GB

ip-172-31-0-112:7846:7846 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>

ip-172-31-0-112:7846:7846 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:7846:7846 [0] NCCL INFO NET/IB : No device found. ip-172-31-0-112:7846:7846 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> NCCL version 2.4.8+cuda10.1 ip-172-31-13-129:7998:7998 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>

ip-172-31-13-129:7998:7998 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:7998:7998 [0] NCCL INFO NET/IB : No device found. ip-172-31-13-129:7998:7998 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-0-112:7846:7870 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:7846:7870 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:7998:8021 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:7998:8021 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:7846:7870 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB ip-172-31-13-129:7998:8021 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB ip-172-31-0-112:7846:7870 [0] NCCL INFO Channel 00 : 0 1 ip-172-31-0-112:7846:7870 [0] NCCL INFO Channel 01 : 0 1 ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0 ip-172-31-0-112:7846:7870 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6. ip-172-31-0-112:7846:7870 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0 ip-172-31-13-129:7998:8021 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6. ip-172-31-13-129:7998:8021 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0 ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0 ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 01 : 1 -> 0 [receive] via NET/Socket/0 ip-172-31-0-112:7846:7870 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 01 : 0 -> 1 [receive] via NET/Socket/0 ip-172-31-13-129:7998:8021 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-0-112:7846:7870 [0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0 ip-172-31-13-129:7998:8021 [0] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/0 ip-172-31-0-112:7846:7870 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled ip-172-31-0-112:7846:7870 [0] NCCL INFO comm 0x7f4d70002350 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE ip-172-31-13-129:7998:8021 [0] NCCL INFO comm 0x7fb80c002350 rank 1 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE #

out-of-place in-place

size count type redop time algbw busbw error time algbw busbw error

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

ip-172-31-0-112:7846:7846 [0] NCCL INFO Launch mode Parallel 8388608 2097152 float sum 1856.6 4.52 4.52 0e+00 1864.3 4.50 4.50 0e+00 16777216 4194304 float sum 3585.0 4.68 4.68 0e+00 3563.1 4.71 4.71 0e+00 33554432 8388608 float sum 6987.2 4.80 4.80 0e+00 6973.5 4.81 4.81 0e+00 67108864 16777216 float sum 13819 4.86 4.86 0e+00 13678 4.91 4.91 0e+00 134217728 33554432 float sum 27567 4.87 4.87 0e+00 27780 4.83 4.83 0e+00 268435456 67108864 float sum 56147 4.78 4.78 0e+00 55876 4.80 4.80 0e+00

Out of bounds values : 0 OK

Avg bus bandwidth : 4.75571

#

For comparison here is the following is the log for 16 GPUs

/opt/amazon/openmpi/bin/mpirun -np 16 -H ip-172-31-0-112:8,ip-172-31-13-129:8 > -bind-to none -map-by slot \

        -x PATH=/opt/amazon/openmpi/bin:$PATH \
        -x NCCL_DEBUG=INFO \
        -x NCCL_SOCKET_NTHREADS=6 \
        -x NCCL_TREE_THRESHOLD=0 \
        -x LD_LIBRARY_PATH=/home/ubuntu/nccl/build/lib:$LD_LIBRARY_PATH \
        -mca btl ^openib \
        -mca btl_tcp_if_exclude lo,docker0 \
        /home/ubuntu/nccl-tests/build/all_reduce_perf -b 8M -e 256M -c 1 -f 2 -n 50

nThread 1 nGpus 1 minBytes 8388608 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 50 validation: 1

#

Using devices

Rank 0 Pid 8159 on ip-172-31-0-112 device 0 [0x00] Tesla V100-SXM2-32GB

Rank 1 Pid 8160 on ip-172-31-0-112 device 1 [0x00] Tesla V100-SXM2-32GB

Rank 2 Pid 8161 on ip-172-31-0-112 device 2 [0x00] Tesla V100-SXM2-32GB

Rank 3 Pid 8162 on ip-172-31-0-112 device 3 [0x00] Tesla V100-SXM2-32GB

Rank 4 Pid 8163 on ip-172-31-0-112 device 4 [0x00] Tesla V100-SXM2-32GB

Rank 5 Pid 8164 on ip-172-31-0-112 device 5 [0x00] Tesla V100-SXM2-32GB

Rank 6 Pid 8165 on ip-172-31-0-112 device 6 [0x00] Tesla V100-SXM2-32GB

Rank 7 Pid 8166 on ip-172-31-0-112 device 7 [0x00] Tesla V100-SXM2-32GB

Rank 8 Pid 8537 on ip-172-31-13-129 device 0 [0x00] Tesla V100-SXM2-32GB

Rank 9 Pid 8538 on ip-172-31-13-129 device 1 [0x00] Tesla V100-SXM2-32GB

Rank 10 Pid 8539 on ip-172-31-13-129 device 2 [0x00] Tesla V100-SXM2-32GB

Rank 11 Pid 8540 on ip-172-31-13-129 device 3 [0x00] Tesla V100-SXM2-32GB

Rank 12 Pid 8541 on ip-172-31-13-129 device 4 [0x00] Tesla V100-SXM2-32GB

Rank 13 Pid 8542 on ip-172-31-13-129 device 5 [0x00] Tesla V100-SXM2-32GB

Rank 14 Pid 8543 on ip-172-31-13-129 device 6 [0x00] Tesla V100-SXM2-32GB

Rank 15 Pid 8544 on ip-172-31-13-129 device 7 [0x00] Tesla V100-SXM2-32GB

ip-172-31-0-112:8159:8159 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>

ip-172-31-0-112:8159:8159 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:8159:8159 [0] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8159:8159 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> NCCL version 2.4.8+cuda10.1 ip-172-31-13-129:8537:8537 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>

ip-172-31-13-129:8537:8537 [0] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8537:8537 [0] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8537:8537 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8543:8543 [6] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>

ip-172-31-13-129:8543:8543 [6] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8543:8543 [6] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8543:8543 [6] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8541:8541 [4] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>

ip-172-31-13-129:8541:8541 [4] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8538:8538 [1] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8541:8541 [4] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8541:8541 [4] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0>

ip-172-31-13-129:8538:8538 [1] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8538:8538 [1] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8538:8538 [1] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-0-112:8161:8161 [2] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>

ip-172-31-0-112:8161:8161 [2] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8542:8542 [5] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0> ip-172-31-0-112:8161:8161 [2] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8161:8161 [2] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0>

ip-172-31-13-129:8542:8542 [5] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8542:8542 [5] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8542:8542 [5] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-0-112:8165:8165 [6] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0> ip-172-31-13-129:8544:8544 [7] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8539:8539 [2] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8540:8540 [3] NCCL INFO Bootstrap : Using [0]ens5:172.31.13.129<0>

ip-172-31-0-112:8165:8165 [6] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:8165:8165 [6] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8165:8165 [6] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0>

ip-172-31-13-129:8544:8544 [7] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported

ip-172-31-13-129:8540:8540 [3] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported

ip-172-31-13-129:8539:8539 [2] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-13-129:8544:8544 [7] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8544:8544 [7] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8539:8539 [2] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8540:8540 [3] NCCL INFO NET/IB : No device found. ip-172-31-13-129:8539:8539 [2] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-13-129:8540:8540 [3] NCCL INFO NET/Socket : Using [0]ens5:172.31.13.129<0> ip-172-31-0-112:8160:8160 [1] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8163:8163 [4] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>

ip-172-31-0-112:8160:8160 [1] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported

ip-172-31-0-112:8163:8163 [4] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:8160:8160 [1] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8163:8163 [4] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8160:8160 [1] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8163:8163 [4] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8162:8162 [3] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8166:8166 [7] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8164:8164 [5] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.112<0>

ip-172-31-0-112:8162:8162 [3] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported

ip-172-31-0-112:8166:8166 [7] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:8162:8162 [3] NCCL INFO NET/IB : No device found.

ip-172-31-0-112:8164:8164 [5] ofi_init:700 NCCL WARN NET/OFI Only EFA provider is supported ip-172-31-0-112:8162:8162 [3] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8166:8166 [7] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8166:8166 [7] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8164:8164 [5] NCCL INFO NET/IB : No device found. ip-172-31-0-112:8164:8164 [5] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.112<0> ip-172-31-0-112:8159:8218 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8159:8218 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8161:8219 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8161:8219 [2] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8165:8220 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8165:8220 [6] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8537:8596 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8537:8596 [0] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8162:8221 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8162:8221 [3] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8163:8222 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8163:8222 [4] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8160:8223 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8160:8223 [1] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8543:8597 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8166:8224 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8166:8224 [7] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8543:8597 [6] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-0-112:8164:8225 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffffffff,ffffffff ip-172-31-0-112:8164:8225 [5] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8541:8598 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8541:8598 [4] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8542:8600 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8542:8600 [5] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8544:8601 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8544:8601 [7] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8538:8599 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8538:8599 [1] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8539:8603 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8539:8603 [2] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8540:8602 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff ip-172-31-13-129:8540:8602 [3] NCCL INFO NCCL_TREE_THRESHOLD set by environment to 0. ip-172-31-13-129:8544:8601 [7] NCCL INFO CUDA Dev 7[7], Socket NIC distance : PHB ip-172-31-13-129:8543:8597 [6] NCCL INFO CUDA Dev 6[6], Socket NIC distance : PHB ip-172-31-13-129:8542:8600 [5] NCCL INFO CUDA Dev 5[5], Socket NIC distance : PHB ip-172-31-13-129:8541:8598 [4] NCCL INFO CUDA Dev 4[4], Socket NIC distance : PHB ip-172-31-13-129:8540:8602 [3] NCCL INFO CUDA Dev 3[3], Socket NIC distance : PHB ip-172-31-13-129:8539:8603 [2] NCCL INFO CUDA Dev 2[2], Socket NIC distance : PHB ip-172-31-13-129:8537:8596 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB ip-172-31-13-129:8538:8599 [1] NCCL INFO CUDA Dev 1[1], Socket NIC distance : PHB ip-172-31-0-112:8160:8223 [1] NCCL INFO CUDA Dev 1[1], Socket NIC distance : PHB ip-172-31-0-112:8164:8225 [5] NCCL INFO CUDA Dev 5[5], Socket NIC distance : PHB ip-172-31-0-112:8166:8224 [7] NCCL INFO CUDA Dev 7[7], Socket NIC distance : PHB ip-172-31-0-112:8162:8221 [3] NCCL INFO CUDA Dev 3[3], Socket NIC distance : PHB ip-172-31-0-112:8161:8219 [2] NCCL INFO CUDA Dev 2[2], Socket NIC distance : PHB ip-172-31-0-112:8159:8218 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB ip-172-31-0-112:8163:8222 [4] NCCL INFO CUDA Dev 4[4], Socket NIC distance : PHB ip-172-31-0-112:8165:8220 [6] NCCL INFO CUDA Dev 6[6], Socket NIC distance : PHB ip-172-31-0-112:8159:8218 [0] NCCL INFO Channel 00 : 0 1 3 2 6 4 5 7 8 9 11 10 14 12 13 15 ip-172-31-0-112:8159:8218 [0] NCCL INFO Channel 01 : 0 1 3 2 6 4 5 7 8 9 11 10 14 12 13 15 ip-172-31-0-112:8159:8218 [0] NCCL INFO Ring 00 : 15 -> 0 [receive] via NET/Socket/0 ip-172-31-0-112:8159:8218 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6. ip-172-31-0-112:8159:8218 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-13-129:8537:8596 [0] NCCL INFO Ring 00 : 7 -> 8 [receive] via NET/Socket/0 ip-172-31-13-129:8537:8596 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 6. ip-172-31-13-129:8537:8596 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-0-112:8159:8218 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC ip-172-31-13-129:8537:8596 [0] NCCL INFO Ring 00 : 8[0] -> 9[1] via P2P/IPC ip-172-31-0-112:8160:8223 [1] NCCL INFO Ring 00 : 1[1] -> 3[3] via P2P/IPC ip-172-31-0-112:8162:8221 [3] NCCL INFO Ring 00 : 3[3] -> 2[2] via P2P/IPC ip-172-31-13-129:8542:8600 [5] NCCL INFO Ring 00 : 13[5] -> 15[7] via P2P/IPC ip-172-31-13-129:8539:8603 [2] NCCL INFO Ring 00 : 10[2] -> 14[6] via P2P/IPC ip-172-31-13-129:8543:8597 [6] NCCL INFO Ring 00 : 14[6] -> 12[4] via P2P/IPC ip-172-31-0-112:8161:8219 [2] NCCL INFO Ring 00 : 2[2] -> 6[6] via P2P/IPC ip-172-31-13-129:8541:8598 [4] NCCL INFO Ring 00 : 12[4] -> 13[5] via P2P/IPC ip-172-31-0-112:8163:8222 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC ip-172-31-13-129:8538:8599 [1] NCCL INFO Ring 00 : 9[1] -> 11[3] via P2P/IPC ip-172-31-0-112:8164:8225 [5] NCCL INFO Ring 00 : 5[5] -> 7[7] via P2P/IPC ip-172-31-0-112:8165:8220 [6] NCCL INFO Ring 00 : 6[6] -> 4[4] via P2P/IPC ip-172-31-13-129:8540:8602 [3] NCCL INFO Ring 00 : 11[3] -> 10[2] via P2P/IPC ip-172-31-13-129:8544:8601 [7] NCCL INFO Ring 00 : 15 -> 0 [send] via NET/Socket/0 ip-172-31-0-112:8166:8224 [7] NCCL INFO Ring 00 : 7 -> 8 [send] via NET/Socket/0 ip-172-31-13-129:8542:8600 [5] NCCL INFO Ring 01 : 13[5] -> 15[7] via P2P/IPC ip-172-31-13-129:8539:8603 [2] NCCL INFO Ring 01 : 10[2] -> 14[6] via P2P/IPC ip-172-31-13-129:8543:8597 [6] NCCL INFO Ring 01 : 14[6] -> 12[4] via P2P/IPC ip-172-31-13-129:8541:8598 [4] NCCL INFO Ring 01 : 12[4] -> 13[5] via P2P/IPC ip-172-31-13-129:8538:8599 [1] NCCL INFO Ring 01 : 9[1] -> 11[3] via P2P/IPC ip-172-31-13-129:8540:8602 [3] NCCL INFO Ring 01 : 11[3] -> 10[2] via P2P/IPC ip-172-31-0-112:8160:8223 [1] NCCL INFO Ring 01 : 1[1] -> 3[3] via P2P/IPC ip-172-31-0-112:8162:8221 [3] NCCL INFO Ring 01 : 3[3] -> 2[2] via P2P/IPC ip-172-31-0-112:8161:8219 [2] NCCL INFO Ring 01 : 2[2] -> 6[6] via P2P/IPC ip-172-31-0-112:8163:8222 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC ip-172-31-0-112:8164:8225 [5] NCCL INFO Ring 01 : 5[5] -> 7[7] via P2P/IPC ip-172-31-0-112:8165:8220 [6] NCCL INFO Ring 01 : 6[6] -> 4[4] via P2P/IPC ip-172-31-13-129:8537:8596 [0] NCCL INFO Ring 01 : 7 -> 8 [receive] via NET/Socket/0 ip-172-31-13-129:8537:8596 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-0-112:8159:8218 [0] NCCL INFO Ring 01 : 15 -> 0 [receive] via NET/Socket/0 ip-172-31-0-112:8159:8218 [0] NCCL INFO NET/Socket: Using 6 threads and 8 sockets per thread ip-172-31-13-129:8537:8596 [0] NCCL INFO Ring 01 : 8[0] -> 9[1] via P2P/IPC ip-172-31-0-112:8159:8218 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC ip-172-31-0-112:8162:8221 [3] NCCL INFO comm 0x7f70fc002350 rank 3 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE ip-172-31-13-129:8539:8603 [2] NCCL INFO comm 0x7f4c08002350 rank 10 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE ip-172-31-13-129:8543:8597 [6] NCCL INFO comm 0x7fc340002350 rank 14 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE ip-172-31-13-129:8541:8598 [4] NCCL INFO comm 0x7f050c002350 rank 12 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE ip-172-31-0-112:8161:8219 [2] NCCL INFO comm 0x7fa72c002350 rank 2 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE ip-172-31-0-112:8163:8222 [4] NCCL INFO comm 0x7f4820002350 rank 4 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE ip-172-31-13-129:8540:8602 [3] NCCL INFO comm 0x7f5b8c002350 rank 11 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE ip-172-31-0-112:8165:8220 [6] NCCL INFO comm 0x7fa238002350 rank 6 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE ip-172-31-0-112:8166:8224 [7] NCCL INFO Ring 01 : 7 -> 8 [send] via NET/Socket/0 ip-172-31-0-112:8164:8225 [5] NCCL INFO comm 0x7fe918002350 rank 5 nranks 16 cudaDev 5 nvmlDev 5 - Init COMPLETE ip-172-31-13-129:8538:8599 [1] NCCL INFO comm 0x7fe140002350 rank 9 nranks 16 cudaDev 1 nvmlDev 1 - Init COMPLETE ip-172-31-0-112:8160:8223 [1] NCCL INFO comm 0x7f6a4c002350 rank 1 nranks 16 cudaDev 1 nvmlDev 1 - Init COMPLETE ip-172-31-13-129:8542:8600 [5] NCCL INFO comm 0x7f221c002350 rank 13 nranks 16 cudaDev 5 nvmlDev 5 - Init COMPLETE ip-172-31-13-129:8544:8601 [7] NCCL INFO Ring 01 : 15 -> 0 [send] via NET/Socket/0 ip-172-31-13-129:8537:8596 [0] NCCL INFO comm 0x7fcbf8002350 rank 8 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE ip-172-31-0-112:8159:8218 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled ip-172-31-0-112:8166:8224 [7] NCCL INFO comm 0x7f5180002350 rank 7 nranks 16 cudaDev 7 nvmlDev 7 - Init COMPLETE ip-172-31-0-112:8159:8218 [0] NCCL INFO comm 0x7f956c002350 rank 0 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE ip-172-31-13-129:8544:8601 [7] NCCL INFO comm 0x7fcb50002350 rank 15 nranks 16 cudaDev 7 nvmlDev 7 - Init COMPLETE #

out-of-place in-place

size count type redop time algbw busbw error time algbw busbw error

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

ip-172-31-0-112:8159:8159 [0] NCCL INFO Launch mode Parallel 8388608 2097152 float sum 2390.0 3.51 6.58 5e-07 2383.8 3.52 6.60 5e-07 16777216 4194304 float sum 3892.3 4.31 8.08 5e-07 3986.7 4.21 7.89 5e-07 33554432 8388608 float sum 7081.6 4.74 8.88 5e-07 7038.6 4.77 8.94 5e-07 67108864 16777216 float sum 18020 3.72 6.98 5e-07 18246 3.68 6.90 5e-07 134217728 33554432 float sum 41511 3.23 6.06 5e-07 56540 2.37 4.45 5e-07 268435456 67108864 float sum 73008 3.68 6.89 5e-07 64965 4.13 7.75 5e-07

Out of bounds values : 0 OK

Avg bus bandwidth : 7.16738

#

I am going to measure the PCI bandwidth for bidirectional and unidirectional cases and report later.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/nccl/issues/398#issuecomment-704497082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEWXVQOPA4XCQZ5PEGW3E3SJNUBVANCNFSM4SDMIHQQ .

zarzen commented 4 years ago

I am currently targetting TCP, because it is a more general transport. but I do find that, for two nodes with 2GPUs case, the EFA does not perform well either. I will update the log later.

zarzen commented 4 years ago

Hi, I tested the PCIe bandwidth, which can give 80Gbps for the bidirectional scenario (with the scripts, and tested on AWS-p3.2xlarge instance). So I think sending and receiving through PCIe of the GPU won't limit the bus bandwidth down to 4.7GB/s.

zarzen commented 4 years ago

Hi @sjeaugey Just noticed on AWS-p3dn the PCIe bandwidth is different comparing to AWS-p3.2xl instances. If we use synchronized cudaMemcpy, both DtoH and HtoD can saturate ~50Gbps. Which seems close to what we can observe

But if we pinned the memory and use cudaMallocHost no matter cudaMemcpyAsync or cudaMemcpy, both DtoH and HtoD bandwidth is around ~90Gbps. I checked the code, I think for the network send/recv buffer nccl uses cudaHostAlloc to get pinned memory.

zarzen commented 4 years ago

Hi

To isolate the network communication bottleneck, I modified the NCCL source code to skip network communications. (by letting the ncclSocketTest set *done=1; and directly return). I got the following performance on two p3dn instances(each with 8V100)

nranks bandwidth (GB/s)
2 ~ 6.7
16 ~10.8

Because I skipped the network operations, the speed for different sizes of all-reduce buffer are almost the same. We can see 16 ranks perform close to max bandwidth over PCIe.

I suspected the nranks * loopSize is too small in nranks==2 case, but merely increase loopSize by increasing the NCCL_BUFFSIZE=32MB didn't contribute to bandwidth.

@wfang recently, there is insufficient capacity for p3dn instances in the region that I can operate on. so I didn't get the bandwidth numbers for EFA. But I remember the bandwidth is around 2.3GB/s, with 2 ranks on 2 p3dn node. Correct me if I am wrong. Thanks!

zarzen commented 4 years ago

I think there might be an inefficient pipeline issue when nranks==2.

The main code snippets inside ringAllreduceKernel (from here) are pasted below.

In when the nranks==2 the second block for recvReduceSend operations and fourth block for directRecvCopySend will be skipped. Thus only prims.send(...), prims.directRecvReduceCopySend(...) and prims.directRecv(...) functions are invoked. Where only second function call prims.directRecvReduceCopySend(...) will do postSend and postRecv operation at the same time. Thus, only the second function call uses the bidirectional bandwidth of PCIe. (based on my current understanding, the prims.send(), prims.directRecvReduceCopySend(...) operations are executed in sync manner. ) Thus, in nranks==2, there is 2/3 of the codes are running with single direction bandwidth usage. But if nranks>2, there are more bidirectional bandwidth usage due to prims.recvReduceSend(...) and prims.directRecvCopySend(...). Thus the larger the nranks the better bus utilization. hi @sjeaugey , could you please confirm/reject my thoughts?

    /////////////// begin AllReduce steps ///////////////
    ssize_t offset;
    int nelem;
    int slice;

    // step 0: push data to next GPU
    slice = ring->devUserRanks[nranks-1];
    offset = chunkOffset + slice * realChunkSize;
    nelem = min(realChunkSize, size-offset);

    prims.send(thisInput+offset, nelem);

    // k-2 steps: reduce and copy to next GPU
    for (int j=2; j<nranks; ++j) {
      slice = ring->devUserRanks[nranks-j];
      offset = chunkOffset + slice * realChunkSize;
      nelem = min(realChunkSize, size-offset);
      // skipped when nranks == 2, 
      // but if nranks > 2, more bidirectional operations (postSend, postRecv) are invoked
      prims.recvReduceSend(thisInput+offset, nelem);
    }

    // step k-1: reduce this buffer and data, which will produce the final
    // result that we store in this data and push to the next GPU
    slice = ring->devUserRanks[0];
    offset = chunkOffset + slice * realChunkSize;
    nelem = min(realChunkSize, size-offset);

    prims.directRecvReduceCopySend(thisInput+offset, thisOutput+offset, offset, nelem);

    // k-2 steps: copy to next GPU
    for (int j=1; j<nranks-1; ++j) {
      slice = ring->devUserRanks[nranks-j];
      offset = chunkOffset + slice * realChunkSize;
      nelem = min(realChunkSize, size-offset);

      prims.directRecvCopySend(thisOutput+offset, offset, nelem);
    }

    // Make final copy from buffer to dest.
    slice = ring->devUserRanks[1];
    offset = chunkOffset + slice * realChunkSize;
    nelem = min(realChunkSize, size-offset);

    // Final wait/copy.
    prims.directRecv(thisOutput+offset, offset, nelem);
sjeaugey commented 4 years ago

Yes, that is correct, the 2 rank case has some inefficiency due to the extra copy step. It's not exactly 2/3 though, as the final copy (recv) is supposed to be local, so it should run at 10-20GB/s. And that's why I didn't think about that in your case since your bus bandwidth is much lower than that. Now I realize the bus BW difference isn't actually that large (around 75%) and it could indeed be enough to explain the difference.

zarzen commented 4 years ago

Hi @sjeaugey Thanks for your confirmation! When you mention extra copy step do you mean the copy step inside directRecvReduceCopySend function? But I think even in nranks>2 cases, this function still invoked, and more prims.directRecvCopySend are called. I suppose those copy latency would be well overlapped? (say, time_of_copy_i will be hidden in time_ofcopy/recv/send{i+1})

BTW, I just want to confirm that does the prims.send() indeed use single directional bandwidth for sending right?

sjeaugey commented 4 years ago

No I meant the last copy. It is always there, but in general, its effect is negligible. We have 2*(nranks-1) normal copy steps, + 1 last local copy step. With 16 GPUs, that last copy would make the total algorithm 31 steps instead of 30, so at most it would add 1/30th of the time which is +3%; 1.5% considering that step is 2x faster than others steps. With 2 GPUs, it's 3 steps instead of 2, so +50% of the time; or +25% at 2x speed.

zarzen commented 4 years ago

That is interesting. Actually, I have tried commenting out the last copy of the main logic, prims.directRecv(), with socket speed infinity fast, by skipping the socket request process, I can only achieve 7.3GB/s bus bandwidth. That's why I am suspecting the pipeline issue for prims.send() and prims.directRecvReduceCopySend(...). Could you confirm that the prims.send and prims.directRecvReduceCopySend functions are not actually well overlapping the send and receive over PCIe? Because I saw inside prims.send there is a barrier for blocking, but I am not so sure (I am also a newbie to CUDA programming). Thanks!

sjeaugey commented 4 years ago

For large enough sizes, both ranks should call send() at a different offset, using the NIC in both directions, then both should call recvReduceCopySend(), again using the NIC in both directions (I'm ignoring direct as this is irrelevant here). And finally they should call recv() which will as the extra step. I'm not sure how you can comment out the recv() call as it is needed to absorb data from the FIFO. The send() and recvReduceCopySend() calls don't have to be overlapped ; they are running in both directions already.

zarzen commented 4 years ago

When both ranks call send(), how could the rank-0 initiate the data receiving operation on NIC? I saw there is a FOR_RECV(postRecv) here, which check the RECV flag. The send() function didn't pass the RECV flag as here. That's why I thought NIC is not going to receive data at send() stage.

What I do is just comment the last prims.directRecv(thisOutput+offset, offset, nelem); for skipping the last copy.

sjeaugey commented 4 years ago

The CPU proxy thread knows about the number of steps to perform, so it will initiate the receive even before the GPU arrives in the recv phase.

Commenting out the last copy should not work as it would not acknowledge data. It should cause a hang, at least after some time.

zarzen commented 4 years ago

Yes, the program is going to hang, at the end of nccl-tests. nccl-tests can output the bus bandwidth, so I didn't take much attention to the hanging issue.

sjeaugey commented 4 years ago

The problem is that I'm not sure what we are actually measuring at this point.

To remove intra-node or inter-node communication the best is to either comment out this line (for intra-node communication): https://github.com/NVIDIA/nccl/blob/master/src/collectives/device/primitives.h#L195 Or force nbytes to 0 here to remove inter-node communication: https://github.com/NVIDIA/nccl/blob/master/src/collectives/device/primitives.h#L107

If you want to only remove the copy for the last recv() step, you can add an if (RECV == 1 && SEND == 0) before the call to ReduceOrCopyMulti.

zarzen commented 4 years ago

We are measuring the bus bandwidth here. The bus bandwidth we would expect is around 11GB/s, due to the PCIe bandwidth limitation. In the beginning, the socket communication between two nodes is involved, thus we got ~4.5GB/s bandwidth for nranks==2 and ~6GB/s for nranks==16 cases. To isolate the bottleneck from network communication, we skipped the socket requests by letting the ncclSocketTest function return the done signal immediately. Which gives us ~6.5GB/s for nranks==2 and 10.8GB/s for nranks==16 cases. Which means there are other performance issues for nranks==2 here. As discussed previously, I think we found one inefficiency when nranks==2, that the time proportion of the last copy operation takes ~25% overhead as you pointed out. While if we take the ~25% as overhead introduced by the last extra copy, then the bus bandwidth should probably be bounded ~8GB/s. Which is larger than ~6.5GB/s, which means there might be other components causing inefficiency in the nranks==2 case. I know it might not be meaningful to stick with the nranks==2 case, but I feel it's the simplest case to start for understanding the NCCL and the performance.

Thanks for your suggestions! I will do some experiments with those code points.

HeGaoYuan commented 11 months ago

watching