NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

nccl-test is throwing timeout error on two nodes #179

Open manomugdha opened 9 months ago

manomugdha commented 9 months ago

I have two nodes manolinux(10.39.43.133) and manolinux1(10.39.42.196) each having two GPUs.

I am running following command on manolinux:

mpirun --allow-run-as-root --np 2  -H 10.39.43.133,10.39.42.196 -x NCCL_DEBUG=INFO  -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=eno1,enp1s0  ./build/all_reduce_perf -b 32 -e 64 -f 2 -g 2

manolinux: cudaDriverVersion 12020 NVIDIA-SMI 535.129.03

manolinux1: cudaDriverVersion 12020 NVIDIA-SMI 535.129.03

Following is the log of this test run:

# nThread 1 nGpus 2 minBytes 1024 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  11811 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid  11811 on  manolinux device  1 [0x03] Quadro P2200
#  Rank  2 Group  0 Pid   3803 on manolinux1 device  0 [0x0f] Quadro K620
#  Rank  3 Group  0 Pid   3803 on manolinux1 device  1 [0x28] Quadro K620
#
# Reducing maxBytes to 339585706 due to memory limitation
manolinux:11811:11811 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:11811:11811 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:11811:11811 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:3803:3803 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:3803:3803 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:3803:3803 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:11811:11821 [1] NCCL INFO NET/IB : No device found.
manolinux:11811:11821 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:11811:11821 [1] NCCL INFO Using non-device net plugin version 0
manolinux:11811:11821 [1] NCCL INFO Using network Socket
manolinux:11811:11820 [0] NCCL INFO Using non-device net plugin version 0
manolinux:11811:11820 [0] NCCL INFO Using network Socket
manolinux1:3803:3811 [0] NCCL INFO NET/IB : No device found.
manolinux1:3803:3811 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:3803:3811 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:3803:3811 [0] NCCL INFO Using network Socket
manolinux1:3803:3812 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:3803:3812 [1] NCCL INFO Using network Socket
manolinux:11811:11821 [1] NCCL INFO comm 0x55762a757750 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xbab4e59cc30d25a1 - Init START
manolinux:11811:11820 [0] NCCL INFO comm 0x55762884db40 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbab4e59cc30d25a1 - Init START
manolinux1:3803:3812 [1] NCCL INFO comm 0x55719b3317a0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xbab4e59cc30d25a1 - Init START
manolinux1:3803:3811 [0] NCCL INFO comm 0x55719b08c280 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xbab4e59cc30d25a1 - Init START
manolinux:11811:11821 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:11811:11821 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:11811:11820 [0] NCCL INFO Channel 00/02 :    0   1   2   3
manolinux:11811:11820 [0] NCCL INFO Channel 01/02 :    0   1   2   3
manolinux:11811:11820 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:11811:11820 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:3803:3811 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:3803:3812 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:3803:3811 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:3803:3812 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:11811:11820 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:11811:11820 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:11811:11820 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:11811:11820 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:11811:11821 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:11811:11821 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux1:3803:3811 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:3803:3811 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:3803:3811 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:3803:3811 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:3803:3812 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:3803:3812 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:11811:11820 [0] NCCL INFO Connected all rings
manolinux:11811:11820 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:11811:11820 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:11811:11820 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:11811:11820 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:3803:3812 [1] NCCL INFO Connected all rings
manolinux:11811:11821 [1] NCCL INFO Connected all rings
manolinux1:3803:3811 [0] NCCL INFO Connected all rings
manolinux:11811:11821 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:3803:3812 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:11811:11821 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:3803:3812 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:3803:3811 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:3803:3811 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:3803:3811 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:3803:3811 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:11811:11821 [1] NCCL INFO Connected all trees
manolinux:11811:11820 [0] NCCL INFO Connected all trees
manolinux:11811:11821 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:11811:11821 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:11811:11820 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:11811:11820 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:3803:3812 [1] NCCL INFO Connected all trees
manolinux1:3803:3812 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:3803:3812 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:3803:3811 [0] NCCL INFO Connected all trees
manolinux1:3803:3811 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:3803:3811 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:11811:11820 [0] NCCL INFO comm 0x55762884db40 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbab4e59cc30d25a1 - Init COMPLETE
manolinux:11811:11821 [1] NCCL INFO comm 0x55762a757750 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xbab4e59cc30d25a1 - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
manolinux1:3803:3811 [0] NCCL INFO comm 0x55719b08c280 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xbab4e59cc30d25a1 - Init COMPLETE
manolinux1:3803:3812 [1] NCCL INFO comm 0x55719b3317a0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xbab4e59cc30d25a1 - Init COMPLETE
manolinux1: Test CUDA failure common.cu:291 'the launch timed out and was terminated'
 .. manolinux1 pid 3803: Test failure common.cu:401
 .. manolinux1 pid 3803: Test failure common.cu:588
 .. manolinux1 pid 3803: Test failure all_reduce.cu:90
 .. manolinux1 pid 3803: Test failure common.cu:615
 .. manolinux1 pid 3803: Test failure common.cu:1019
 .. manolinux1 pid 3803: Test failure common.cu:844

manolinux:11811:11824 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer manolinux1.ccu.is.keysight.com<36798>
manolinux:11811:11824 [0] NCCL INFO misc/socket.cc:750 -> 6
manolinux:11811:11824 [0] NCCL INFO transport/net_socket.cc:473 -> 6
manolinux:11811:11824 [0] NCCL INFO transport/net.cc:1245 -> 6
manolinux:11811:11824 [0] NCCL INFO proxy.cc:692 -> 6
manolinux:11811:11824 [0] NCCL INFO proxy.cc:872 -> 6 [Progress Thread]

What is the reason of 'the launch timed out and was terminated'?

if I run above command with gpu 1 then it runs but gives following error:

mpirun --allow-run-as-root --np 2  -H 10.39.43.133,10.39.42.196 -x NCCL_DEBUG=INFO  -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=eno1,enp1s0  ./build/all_reduce_perf -b 32 -e 64 -f 2 -g 1

logs:

# nThread 1 nGpus 1 minBytes 32 maxBytes 64 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  12171 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid   4449 on manolinux1 device  0 [0x0f] Quadro K620
manolinux:12171:12171 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1,enp1s0
manolinux:12171:12171 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:12171:12171 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:12171:12171 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:4449:4449 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:4449:4449 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1,enp1s0
manolinux1:4449:4449 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:4449:4449 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:12171:12178 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1,enp1s0
manolinux:12171:12178 [0] NCCL INFO NET/IB : No device found.
manolinux:12171:12178 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1,enp1s0
manolinux:12171:12178 [0] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:12171:12178 [0] NCCL INFO Using non-device net plugin version 0
manolinux:12171:12178 [0] NCCL INFO Using network Socket
manolinux1:4449:4455 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1,enp1s0
manolinux1:4449:4455 [0] NCCL INFO NET/IB : No device found.
manolinux1:4449:4455 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1,enp1s0
manolinux1:4449:4455 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:4449:4455 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:4449:4455 [0] NCCL INFO Using network Socket
manolinux:12171:12178 [0] NCCL INFO comm 0x55e8dd240cf0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x87070138c314c93 - Init START
manolinux1:4449:4455 [0] NCCL INFO comm 0x55836de68c10 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId f000 commId 0x87070138c314c93 - Init START
manolinux:12171:12178 [0] NCCL INFO Channel 00/02 :    0   1
manolinux:12171:12178 [0] NCCL INFO Channel 01/02 :    0   1
manolinux:12171:12178 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
manolinux:12171:12178 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:4449:4455 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
manolinux1:4449:4455 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:12171:12178 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:4449:4455 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/Socket/0
manolinux1:4449:4455 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/Socket/0
manolinux1:4449:4455 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/Socket/0
manolinux1:4449:4455 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/Socket/0
manolinux:12171:12178 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
manolinux:12171:12178 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/Socket/0
manolinux:12171:12178 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/Socket/0
manolinux:12171:12178 [0] NCCL INFO Connected all rings
manolinux:12171:12178 [0] NCCL INFO Connected all trees
manolinux:12171:12178 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
manolinux:12171:12178 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:4449:4455 [0] NCCL INFO Connected all rings
manolinux1:4449:4455 [0] NCCL INFO Connected all trees
manolinux1:4449:4455 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
manolinux1:4449:4455 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:12171:12178 [0] NCCL INFO comm 0x55e8dd240cf0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x87070138c314c93 - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
manolinux1:4449:4455 [0] NCCL INFO comm 0x55836de68c10 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId f000 commId 0x87070138c314c93 - Init COMPLETE
          32             8     float     sum      -1    196.1    0.00    0.00      8    200.4    0.00    0.00      8
          64            16     float     sum      -1    201.6    0.00    0.00     16    199.3    0.00    0.00     16
manolinux:12171:12171 [0] NCCL INFO comm 0x55e8dd240cf0 rank 0 nranks 2 cudaDev 0 busId 1000 - Destroy COMPLETE
# Out of bounds values : 4 FAILED
# Avg bus bandwidth    : 0.000240338
#
manolinux1:4449:4449 [0] NCCL INFO comm 0x55836de68c10 rank 1 nranks 2 cudaDev 0 busId f000 - Destroy COMPLETE

it seems the run completed but what is the reason of reporting out of bound values?

AddyLaddy commented 9 months ago

I'd start by running some GPU sanity tests like nvbandwidth: https://github.com/NVIDIA/nvbandwidth

sjeaugey commented 9 months ago

Quadro K620 GPUs are Maxwell generation. You need to recompile the NCCL perf tests adding sm_50 support, otherwise the data verification may fail. Quadro RTX A2000 is sm_86. And quadro P2200 is sm_61.

So I would advise to re-compile NCCL and the NCCL perf tests from scratch setting NVCC_GENCODE=gencode="arch=compute_50,code=sm_50 arch=compute_61,code=sm_61 gencode=arch=compute_86,code=sm_86". That way both NCCL and the NCCL perf tests would have support for all 3 types of GPUs.

manomugdha commented 9 months ago

Hi @sjeaugey , thank you for your reply. i compiled both nccl and nccl-test without specifying any arch and expected that it would include all arch. now from log i see it does include arch-86. now i have compiled nccl (after make clean) with following command and it went well.

make -j 8 src.build CUDA_HOME=/usr/ NVCC_GENCODE="-gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_86,code=sm_86"

next i compiled nccl-test (after make clean) with the following command

make MPI=1 NCCL_HOME=/home/mbiswas/ai/pytorch/nccl/build MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi CUDA_HOME=/usr/ NVCC_GENCODE="-gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_86,code=sm_86"

and seeing following warning:

nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt (target: sm_60)

my ncc version is 2.19.3 current mpi version is mpirun (Open MPI) 4.1.2

i installed mpi using following command:

sudo apt-get install openmpi-bin openmpi-doc libopenmpi-dev

how to get the correct mpi version for nccl 2.19.3?

manomugdha commented 9 months ago

I'd start by running some GPU sanity tests like nvbandwidth: https://github.com/NVIDIA/nvbandwidth

ok, will check that.

sjeaugey commented 9 months ago

Is there an error when recompiling the NCCL tests? Not sure why the warning happens and why it's mentioning sm_60 .. did you run make clean before recompiling?

manomugdha commented 9 months ago

no error, only warning. it shows warning for all 3 arch.

Linking  /home/mbiswas/ai/pytorch/nccl-tests/build/scatter.o > /home/mbiswas/ai/pytorch/nccl-tests/build/scatter_perf
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt (target: sm_50)
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt (target: sm_61)
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt (target: sm_86)

yes, i ran make clean before recompiling. it seems that mpi version is not compatible.

sjeaugey commented 9 months ago

Sorry but I don't see any error in the logs you reported. Are you unable to run the tests? You are mentioning that there is a problem with MPI but I don't see any, only that you could recompile everything just fine.

manomugdha commented 9 months ago

Hi @sjeaugey, i mentioned two issues at the beginning of this thread.

  1. why it is reporting "# Out of bounds values : 4 FAILED" when -g is set to 1 for a cluster of 2 nodes?
  2. What is the reason of 'the launch timed out and was terminated' when -g is set to 2 for a cluster of 2 nodes?

can you please share some light on these?

sjeaugey commented 9 months ago

So are you saying that even after recompiling everything, you are still seeing the same exact problems (out of bound values and launch timeout) under the same conditions?

It isn't clear from your comments, for example this one:

it seems that mpi version is not compatible.

manomugdha commented 9 months ago

yes, even after recompiling i am seeing these two issues. and during compilation of nccl-test i am seeing this warning (nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt). what I wanted to say is that are these issues happening because of this warning? if so then how to get rid of this warning?

sjeaugey commented 9 months ago

what I wanted to say is that are these issues happening because of this warning?

Very likely not.

how to get rid of this warning?

I don't know. I don't remember having seen that error myself.

even after recompiling i am seeing these two issues

That's quite surprising, in particular the data being reported as incorrect is typical of the verification code not being recompiled for sm_50, which is not the case by default on CUDA 11, so we've had lots of similar reports on Kepler/Maxwell architectures which were solved by making sure we'd recompile those kernels with sm_50.

manomugdha commented 9 months ago

this is CUDA Version: 12.2

manomugdha commented 9 months ago

following command runs on each node fine.

mpirun -np 1 -H 10.39.43.133,10.39.42.196 -x LD_LIBRARY_PATH ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 2

but when np value is given 2 it waits for some time and then through following error:

Test CUDA failure common.cu:291 'the launch timed out and was terminated'
sjeaugey commented 9 months ago

this is CUDA Version: 12.2

That's weird. This is from your log:

NCCL version 2.19.3+cuda11.5

Note the "cuda11.5" in the NCCL version.

the launch timed out and was terminated

I've never seen that before, but it is a CUDA error and I don't see how that could be related to launching on multiple nodes.

following command runs on each node fine.

Did you launch that command from each node? That would be the exact same test. To launch 2 GPUs on the second node, you'd need to run:

mpirun -np 1 -H 10.39.42.196 -x LD_LIBRARY_PATH ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 2
manomugdha commented 9 months ago

test is running fine on individual node with 2 GPUs.

manolinux:11811:11811 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5

mismatch in cuda driver version and cuda version can cause this time out error? output of nvidia-smi:

mbiswas@manolinux:~/ai/pytorch/nccl-tests$ nvidia-smi 
Tue Nov 21 18:25:56 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A2000               Off | 00000000:01:00.0 Off |                  Off |
| 30%   34C    P8              11W /  70W |     19MiB /  6138MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Quadro P2200                   Off | 00000000:03:00.0 Off |                  N/A |
| 45%   26C    P8               4W /  75W |      6MiB /  5120MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1329      G   /usr/lib/xorg/Xorg                           10MiB |
|    0   N/A  N/A      1490      G   /usr/bin/gnome-shell                          3MiB |
|    1   N/A  N/A      1329      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+
mbiswas@manolinux:~/ai/pytorch/nccl-tests$ 

I ran the command from a single node when tried to run the test across 2 nodes.

I see that few people have faced this timeout issue (github) but their solution is not working for me.

sjeaugey commented 9 months ago

mismatch in cuda driver version and cuda version can cause this time out error?

I'm not sure, but you mentioned you were using CUDA 12.2 and yet the NCCL version which is used mentions 11.5 so maybe you're not using the right NCCL library if multiple versions are installed on your system? Or is it that you recompiled NCCL with CUDA 11.5 but your driver is 12.2?

test is running fine on individual node with 2 GPUs.

That's what I'm not sure about. I would expect the CUDA launch timeout error to also happen when running on a single node. It would be helpful to share the output of both single node runs: the one with the RTX A2000 + P2200 and the one with 2 K620. There may be hints about what happening when we run on 2 + 2.

Other than that, it could be useful to try to stop Xorg on both systems, as we've seen it interact with CUDA in the past.

manomugdha commented 9 months ago

I'm not sure, but you mentioned you were using CUDA 12.2 and yet the NCCL version which is used mentions 11.5 so maybe you're not using the right NCCL library if multiple versions are installed on your system? Or is it that you recompiled NCCL with CUDA 11.5 but your driver is 12.2?

yes, i think in my case it is the last case i.e. nccl is compiled with 11.5 and driver is 12.2. I will update it tomorrow and will update you.

That's what I'm not sure about. I would expect the CUDA launch timeout error to also happen when running on a single node. It would be helpful to share the output of both single node runs: the one with the RTX A2000 + P2200 and the one with 2 K620. There may be hints about what happening when we run on 2 + 2.

logs for RTX A2000 + P2200:

# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   2522 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid   2522 on  manolinux device  1 [0x03] Quadro P2200
manolinux:2522:2522 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:2522:2522 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:2522:2522 [1] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux:2522:2533 [1] NCCL INFO NET/IB : No device found.
manolinux:2522:2533 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:2522:2533 [1] NCCL INFO Using non-device net plugin version 0
manolinux:2522:2533 [1] NCCL INFO Using network Socket
manolinux:2522:2532 [0] NCCL INFO Using non-device net plugin version 0
manolinux:2522:2532 [0] NCCL INFO Using network Socket
manolinux:2522:2533 [1] NCCL INFO comm 0x555d04cb4b70 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3000 commId 0x451afe3e3c9aaa07 - Init START
manolinux:2522:2532 [0] NCCL INFO comm 0x555d04cb00e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x451afe3e3c9aaa07 - Init START

manolinux:2522:2532 [0] graph/search.cc:1024 NCCL WARN Could not find a path for pattern 4, falling back to simple order

manolinux:2522:2532 [0] graph/search.cc:1024 NCCL WARN Could not find a path for pattern 1, falling back to simple order

manolinux:2522:2533 [1] graph/search.cc:1024 NCCL WARN Could not find a path for pattern 4, falling back to simple order

manolinux:2522:2533 [1] graph/search.cc:1024 NCCL WARN Could not find a path for pattern 1, falling back to simple order
manolinux:2522:2532 [0] NCCL INFO Channel 00/02 :    0   1
manolinux:2522:2532 [0] NCCL INFO Channel 01/02 :    0   1
manolinux:2522:2533 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:2522:2533 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:2522:2532 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
manolinux:2522:2532 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:2522:2533 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:2522:2533 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:2522:2532 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:2522:2532 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:2522:2533 [1] NCCL INFO Connected all rings
manolinux:2522:2533 [1] NCCL INFO Connected all trees
manolinux:2522:2532 [0] NCCL INFO Connected all rings
manolinux:2522:2533 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
manolinux:2522:2533 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:2522:2532 [0] NCCL INFO Connected all trees
manolinux:2522:2532 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
manolinux:2522:2532 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:2522:2532 [0] NCCL INFO comm 0x555d04cb00e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x451afe3e3c9aaa07 - Init COMPLETE
manolinux:2522:2533 [1] NCCL INFO comm 0x555d04cb4b70 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3000 commId 0x451afe3e3c9aaa07 - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    12.01    0.00    0.00      0    12.29    0.00    0.00      0
          16             4     float     sum      -1    12.39    0.00    0.00      0    12.08    0.00    0.00      0
          32             8     float     sum      -1    12.47    0.00    0.00      0    12.52    0.00    0.00      0
          64            16     float     sum      -1    12.70    0.01    0.01      0    12.45    0.01    0.01      0
         128            32     float     sum      -1    12.77    0.01    0.01      0    31.98    0.00    0.00      0
         256            64     float     sum      -1    12.94    0.02    0.02      0    12.88    0.02    0.02      0
         512           128     float     sum      -1    30.54    0.02    0.02      0    30.22    0.02    0.02      0
        1024           256     float     sum      -1    33.85    0.03    0.03      0    33.81    0.03    0.03      0
        2048           512     float     sum      -1    34.56    0.06    0.06      0    34.44    0.06    0.06      0
        4096          1024     float     sum      -1    34.90    0.12    0.12      0    34.75    0.12    0.12      0
        8192          2048     float     sum      -1    38.11    0.21    0.21      0    37.93    0.22    0.22      0
       16384          4096     float     sum      -1    43.94    0.37    0.37      0    43.50    0.38    0.38      0
       32768          8192     float     sum      -1    50.83    0.64    0.64      0    51.17    0.64    0.64      0
       65536         16384     float     sum      -1    69.18    0.95    0.95      0    69.15    0.95    0.95      0
      131072         32768     float     sum      -1    104.6    1.25    1.25      0    103.9    1.26    1.26      0
      262144         65536     float     sum      -1    175.2    1.50    1.50      0    174.6    1.50    1.50      0
      524288        131072     float     sum      -1    319.0    1.64    1.64      0    317.1    1.65    1.65      0
     1048576        262144     float     sum      -1    593.7    1.77    1.77      0    589.8    1.78    1.78      0
manolinux:2522:2522 [1] NCCL INFO comm 0x555d04cb00e0 rank 0 nranks 2 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux:2522:2522 [1] NCCL INFO comm 0x555d04cb4b70 rank 1 nranks 2 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.478737
#

logs for K620:

# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  11738 on manolinux1 device  0 [0x0f] Quadro K620
#  Rank  1 Group  0 Pid  11738 on manolinux1 device  1 [0x28] Quadro K620
manolinux1:11738:11738 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:11738:11738 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux1:11738:11738 [1] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:11738:11749 [1] NCCL INFO NET/IB : No device found.
manolinux1:11738:11749 [1] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:11738:11749 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:11738:11749 [1] NCCL INFO Using network Socket
manolinux1:11738:11748 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:11738:11748 [0] NCCL INFO Using network Socket
manolinux1:11738:11749 [1] NCCL INFO comm 0x5591693c2090 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 28000 commId 0xb63cd1802bc84a6 - Init START
manolinux1:11738:11748 [0] NCCL INFO comm 0x5591693bd8e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId f000 commId 0xb63cd1802bc84a6 - Init START
manolinux1:11738:11748 [0] NCCL INFO Channel 00/02 :    0   1
manolinux1:11738:11748 [0] NCCL INFO Channel 01/02 :    0   1
manolinux1:11738:11749 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux1:11738:11748 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
manolinux1:11738:11748 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:11738:11749 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:11738:11748 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:11738:11748 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:11738:11749 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:11738:11749 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:11738:11748 [0] NCCL INFO Connected all rings
manolinux1:11738:11748 [0] NCCL INFO Connected all trees
manolinux1:11738:11749 [1] NCCL INFO Connected all rings
manolinux1:11738:11749 [1] NCCL INFO Connected all trees
manolinux1:11738:11749 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
manolinux1:11738:11749 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:11738:11748 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
manolinux1:11738:11748 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:11738:11749 [1] NCCL INFO comm 0x5591693c2090 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 28000 commId 0xb63cd1802bc84a6 - Init COMPLETE
manolinux1:11738:11748 [0] NCCL INFO comm 0x5591693bd8e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId f000 commId 0xb63cd1802bc84a6 - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    18.15    0.00    0.00      0    13.67    0.00    0.00      0
          16             4     float     sum      -1    14.31    0.00    0.00      0    13.57    0.00    0.00      0
          32             8     float     sum      -1    14.32    0.00    0.00      0    13.73    0.00    0.00      0
          64            16     float     sum      -1    14.30    0.00    0.00      0    13.57    0.00    0.00      0
         128            32     float     sum      -1    14.14    0.01    0.01      0    13.65    0.01    0.01      0
         256            64     float     sum      -1    14.28    0.02    0.02      0    13.66    0.02    0.02      0
         512           128     float     sum      -1    15.46    0.03    0.03      0    13.91    0.04    0.04      0
        1024           256     float     sum      -1    16.11    0.06    0.06      0    19.46    0.05    0.05      0
        2048           512     float     sum      -1    22.26    0.09    0.09      0    21.69    0.09    0.09      0
        4096          1024     float     sum      -1    23.21    0.18    0.18      0    22.84    0.18    0.18      0
        8192          2048     float     sum      -1    33.73    0.24    0.24      0    29.53    0.28    0.28      0
       16384          4096     float     sum      -1    41.23    0.40    0.40      0    39.02    0.42    0.42      0
       32768          8192     float     sum      -1    51.25    0.64    0.64      0    51.03    0.64    0.64      0
       65536         16384     float     sum      -1    68.99    0.95    0.95      0    71.94    0.91    0.91      0
      131072         32768     float     sum      -1    103.1    1.27    1.27      0    102.5    1.28    1.28      0
      262144         65536     float     sum      -1    165.8    1.58    1.58      0    165.7    1.58    1.58      0
      524288        131072     float     sum      -1    303.9    1.73    1.73      0    300.0    1.75    1.75      0
     1048576        262144     float     sum      -1    572.9    1.83    1.83      0    581.4    1.80    1.80      0
manolinux1:11738:11738 [1] NCCL INFO comm 0x5591693bd8e0 rank 0 nranks 2 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux1:11738:11738 [1] NCCL INFO comm 0x5591693c2090 rank 1 nranks 2 cudaDev 1 busId 28000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.502758
#

Other than that, it could be useful to try to stop Xorg on both systems, as we've seen it interact with CUDA in the past.

I stopped xorg from both machine. now it is stuck with the following logs:

# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   2572 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid   2572 on  manolinux device  1 [0x03] Quadro P2200
#  Rank  2 Group  0 Pid  11844 on manolinux1 device  0 [0x0f] Quadro K620
#  Rank  3 Group  0 Pid  11844 on manolinux1 device  1 [0x28] Quadro K620
manolinux:2572:2572 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:2572:2572 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:2572:2572 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:11844:11844 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:11844:11844 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:11844:11844 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:2572:2582 [1] NCCL INFO NET/IB : No device found.
manolinux:2572:2582 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:2572:2582 [1] NCCL INFO Using non-device net plugin version 0
manolinux:2572:2582 [1] NCCL INFO Using network Socket
manolinux:2572:2581 [0] NCCL INFO Using non-device net plugin version 0
manolinux:2572:2581 [0] NCCL INFO Using network Socket
manolinux1:11844:11852 [0] NCCL INFO NET/IB : No device found.
manolinux1:11844:11852 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:11844:11852 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:11844:11852 [0] NCCL INFO Using network Socket
manolinux1:11844:11853 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:11844:11853 [1] NCCL INFO Using network Socket
manolinux:2572:2582 [1] NCCL INFO comm 0x5643b1ab6760 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xcd29dd667b81b5b4 - Init START
manolinux:2572:2581 [0] NCCL INFO comm 0x5643afbacb50 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xcd29dd667b81b5b4 - Init START
manolinux1:11844:11853 [1] NCCL INFO comm 0x564b59a03620 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xcd29dd667b81b5b4 - Init START
manolinux1:11844:11852 [0] NCCL INFO comm 0x564b5975e100 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xcd29dd667b81b5b4 - Init START
manolinux1:11844:11853 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:11844:11853 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:2572:2582 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:2572:2582 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:2572:2581 [0] NCCL INFO Channel 00/02 :    0   1   2   3
manolinux:2572:2581 [0] NCCL INFO Channel 01/02 :    0   1   2   3
manolinux:2572:2581 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:2572:2581 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:11844:11852 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:11844:11852 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:2572:2582 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:2572:2582 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:11844:11852 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:11844:11853 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:2572:2581 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:11844:11853 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Connected all rings
manolinux:2572:2582 [1] NCCL INFO Connected all rings
manolinux1:11844:11853 [1] NCCL INFO Connected all rings
manolinux1:11844:11852 [0] NCCL INFO Connected all rings
manolinux:2572:2582 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:11844:11853 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:2572:2582 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:11844:11853 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:11844:11852 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:2572:2582 [1] NCCL INFO Connected all trees
manolinux:2572:2581 [0] NCCL INFO Connected all trees
manolinux:2572:2582 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:2572:2582 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:2572:2581 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:2572:2581 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:11844:11853 [1] NCCL INFO Connected all trees
manolinux1:11844:11853 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:11844:11852 [0] NCCL INFO Connected all trees
manolinux1:11844:11853 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:11844:11852 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:11844:11852 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:2572:2581 [0] NCCL INFO comm 0x5643afbacb50 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xcd29dd667b81b5b4 - Init COMPLETE
manolinux:2572:2582 [1] NCCL INFO comm 0x5643b1ab6760 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xcd29dd667b81b5b4 - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
manolinux1:11844:11853 [1] NCCL INFO comm 0x564b59a03620 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xcd29dd667b81b5b4 - Init COMPLETE
manolinux1:11844:11852 [0] NCCL INFO comm 0x564b5975e100 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xcd29dd667b81b5b4 - Init COMPLETE
           8             2     float     sum      -1    178.8    0.00    0.00      4    209.0    0.00    0.00      4
          16             4     float     sum      -1    209.1    0.00    0.00      8    207.6    0.00    0.00      6
          32             8     float     sum      -1    210.3    0.00    0.00     16    212.1    0.00    0.00     12
          64            16     float     sum      -1    211.8    0.00    0.00     30    210.6    0.00    0.00     24
         128            32     float     sum      -1    212.3    0.00    0.00     54    212.9    0.00    0.00     52
         256            64     float     sum      -1    189.8    0.00    0.00    114    196.0    0.00    0.00    118
         512           128     float     sum      -1    209.6    0.00    0.00    234    212.6    0.00    0.00    230
        1024           256     float     sum      -1    232.5    0.00    0.01    436    227.2    0.00    0.01    452
        2048           512     float     sum      -1    294.7    0.01    0.01    864    303.3    0.01    0.01    906
        4096          1024     float     sum      -1    349.4    0.01    0.02   1818    311.4    0.01    0.02   1768
        8192          2048     float     sum      -1    373.0    0.02    0.03   3616    372.4    0.02    0.03   3612
sjeaugey commented 9 months ago

Thanks for bearing with me. There is progress.

Could you try the follow 4 combinations: NCCL_ALGO=RING / NCCL_ALGO=TREE × NCCL_PROTO=SIMPLE / NCCL_PROTO=LL

manomugdha commented 9 months ago

logs for algo=ring. it does not stuck but fails

# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   3283 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid   3283 on  manolinux device  1 [0x03] Quadro P2200
#  Rank  2 Group  0 Pid  12203 on manolinux1 device  0 [0x0f] Quadro K620
#  Rank  3 Group  0 Pid  12203 on manolinux1 device  1 [0x28] Quadro K620
manolinux:3283:3283 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:3283:3283 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3283:3283 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:12203:12203 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:12203:12203 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:12203:12203 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3283:3293 [1] NCCL INFO NET/IB : No device found.
manolinux:3283:3293 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:3283:3293 [1] NCCL INFO Using non-device net plugin version 0
manolinux:3283:3293 [1] NCCL INFO Using network Socket
manolinux:3283:3292 [0] NCCL INFO Using non-device net plugin version 0
manolinux:3283:3292 [0] NCCL INFO Using network Socket
manolinux1:12203:12211 [0] NCCL INFO NET/IB : No device found.
manolinux1:12203:12211 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:12203:12211 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:12203:12211 [0] NCCL INFO Using network Socket
manolinux1:12203:12212 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:12203:12212 [1] NCCL INFO Using network Socket
manolinux:3283:3293 [1] NCCL INFO comm 0x55be817d35b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x4eab9cb0e8af46b7 - Init START
manolinux:3283:3292 [0] NCCL INFO comm 0x55be7f8c99a0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x4eab9cb0e8af46b7 - Init START
manolinux1:12203:12212 [1] NCCL INFO comm 0x55f023bb4640 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x4eab9cb0e8af46b7 - Init START
manolinux1:12203:12211 [0] NCCL INFO comm 0x55f02390f120 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x4eab9cb0e8af46b7 - Init START
manolinux:3283:3293 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:3283:3293 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:3283:3292 [0] NCCL INFO Channel 00/02 :    0   1   2   3
manolinux:3283:3292 [0] NCCL INFO Channel 01/02 :    0   1   2   3
manolinux:3283:3292 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:3283:3292 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:12203:12212 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:12203:12212 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:12203:12211 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:12203:12211 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:3283:3293 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3283:3293 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux1:12203:12212 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12203:12212 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:3283:3292 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12203:12211 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:3283:3292 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3283:3292 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3283:3292 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3283:3293 [1] NCCL INFO Connected all rings
manolinux:3283:3293 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3283:3293 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3283:3292 [0] NCCL INFO Connected all rings
manolinux1:12203:12211 [0] NCCL INFO Connected all rings
manolinux1:12203:12212 [1] NCCL INFO Connected all rings
manolinux1:12203:12212 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12203:12212 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:3283:3292 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux:3283:3292 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3283:3292 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3283:3292 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:3283:3293 [1] NCCL INFO Connected all trees
manolinux:3283:3292 [0] NCCL INFO Connected all trees
manolinux:3283:3293 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:3283:3292 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:3283:3293 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3283:3293 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3283:3292 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3283:3292 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12203:12212 [1] NCCL INFO Connected all trees
manolinux1:12203:12211 [0] NCCL INFO Connected all trees
manolinux1:12203:12211 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:12203:12212 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:12203:12212 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12203:12212 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12203:12211 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12203:12211 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3283:3292 [0] NCCL INFO comm 0x55be7f8c99a0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x4eab9cb0e8af46b7 - Init COMPLETE
manolinux:3283:3293 [1] NCCL INFO comm 0x55be817d35b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x4eab9cb0e8af46b7 - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
manolinux1:12203:12212 [1] NCCL INFO comm 0x55f023bb4640 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x4eab9cb0e8af46b7 - Init COMPLETE
manolinux1:12203:12211 [0] NCCL INFO comm 0x55f02390f120 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x4eab9cb0e8af46b7 - Init COMPLETE
           8             2     float     sum      -1    214.8    0.00    0.00      4    232.3    0.00    0.00      4
          16             4     float     sum      -1    222.7    0.00    0.00      8    227.4    0.00    0.00      6
          32             8     float     sum      -1    222.6    0.00    0.00     16    225.2    0.00    0.00     12
          64            16     float     sum      -1    226.2    0.00    0.00     30    230.4    0.00    0.00     24
         128            32     float     sum      -1    246.3    0.00    0.00     54    262.5    0.00    0.00     52
         256            64     float     sum      -1    265.7    0.00    0.00    114    227.0    0.00    0.00    118
         512           128     float     sum      -1    272.8    0.00    0.00    234    265.8    0.00    0.00    230
        1024           256     float     sum      -1    345.4    0.00    0.00    436    271.6    0.00    0.01    452
        2048           512     float     sum      -1    329.0    0.01    0.01    864    407.2    0.01    0.01    906
        4096          1024     float     sum      -1    454.7    0.01    0.01   1818    403.1    0.01    0.02   1768
        8192          2048     float     sum      -1    419.3    0.02    0.03   3616    413.1    0.02    0.03   3612
       16384          4096     float     sum      -1    463.6    0.04    0.05   7232    457.8    0.04    0.05   7206
       32768          8192     float     sum      -1    686.1    0.05    0.07  14402    693.6    0.05    0.07  14318
       65536         16384     float     sum      -1   1221.9    0.05    0.08  28600   1233.9    0.05    0.08  28676
      131072         32768     float     sum      -1   2280.2    0.06    0.09  57300   2330.5    0.06    0.08  57328
      262144         65536     float     sum      -1   4332.0    0.06    0.09  114442   4368.4    0.06    0.09  114582
      524288        131072     float     sum      -1   8176.5    0.06    0.10  229370   8187.9    0.06    0.10  229484
     1048576        262144     float     sum      -1    16256    0.06    0.10  458762    16242    0.06    0.10  458174
manolinux:3283:3283 [1] NCCL INFO comm 0x55be7f8c99a0 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:12203:12203 [1] NCCL INFO comm 0x55f02390f120 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:3283:3283 [1] NCCL INFO comm 0x55be817d35b0 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth    : 0.0353681
#
manolinux1:12203:12203 [1] NCCL INFO comm 0x55f023bb4640 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE

logs for algo=tree. it does not stuck but fails

# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   3319 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid   3319 on  manolinux device  1 [0x03] Quadro P2200
#  Rank  2 Group  0 Pid  12260 on manolinux1 device  0 [0x0f] Quadro K620
#  Rank  3 Group  0 Pid  12260 on manolinux1 device  1 [0x28] Quadro K620
manolinux:3319:3319 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:3319:3319 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3319:3319 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:12260:12260 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:12260:12260 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:12260:12260 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3319:3329 [1] NCCL INFO NET/IB : No device found.
manolinux:3319:3329 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:3319:3329 [1] NCCL INFO Using non-device net plugin version 0
manolinux:3319:3329 [1] NCCL INFO Using network Socket
manolinux:3319:3328 [0] NCCL INFO Using non-device net plugin version 0
manolinux:3319:3328 [0] NCCL INFO Using network Socket
manolinux1:12260:12268 [0] NCCL INFO NET/IB : No device found.
manolinux1:12260:12268 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:12260:12268 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:12260:12268 [0] NCCL INFO Using network Socket
manolinux1:12260:12269 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:12260:12269 [1] NCCL INFO Using network Socket
manolinux:3319:3328 [0] NCCL INFO comm 0x5647459c65c0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x540a9c130dcc7f4e - Init START
manolinux:3319:3329 [1] NCCL INFO comm 0x5647478d01d0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x540a9c130dcc7f4e - Init START
manolinux1:12260:12268 [0] NCCL INFO comm 0x5604044d60d0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x540a9c130dcc7f4e - Init START
manolinux1:12260:12269 [1] NCCL INFO comm 0x56040477b5f0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x540a9c130dcc7f4e - Init START
manolinux:3319:3329 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:3319:3329 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:3319:3328 [0] NCCL INFO Channel 00/02 :    0   1   2   3
manolinux:3319:3328 [0] NCCL INFO Channel 01/02 :    0   1   2   3
manolinux:3319:3328 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:3319:3328 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:12260:12269 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:12260:12269 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:12260:12268 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:12260:12268 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:3319:3328 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3319:3328 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3319:3329 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3319:3329 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12260:12269 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12260:12269 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12260:12268 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:3319:3329 [1] NCCL INFO Connected all rings
manolinux:3319:3329 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3319:3329 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3319:3328 [0] NCCL INFO Connected all rings
manolinux1:12260:12268 [0] NCCL INFO Connected all rings
manolinux1:12260:12269 [1] NCCL INFO Connected all rings
manolinux1:12260:12269 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12260:12269 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12260:12268 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Connected all trees
manolinux:3319:3328 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:3319:3328 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3319:3328 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12260:12269 [1] NCCL INFO Connected all trees
manolinux1:12260:12269 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:12260:12268 [0] NCCL INFO Connected all trees
manolinux1:12260:12269 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12260:12269 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12260:12268 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:12260:12268 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12260:12268 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3319:3329 [1] NCCL INFO Connected all trees
manolinux:3319:3329 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:3319:3329 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3319:3329 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3319:3329 [1] NCCL INFO comm 0x5647478d01d0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x540a9c130dcc7f4e - Init COMPLETE
manolinux:3319:3328 [0] NCCL INFO comm 0x5647459c65c0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x540a9c130dcc7f4e - Init COMPLETE
#
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
manolinux1:12260:12268 [0] NCCL INFO comm 0x5604044d60d0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x540a9c130dcc7f4e - Init COMPLETE
manolinux1:12260:12269 [1] NCCL INFO comm 0x56040477b5f0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x540a9c130dcc7f4e - Init COMPLETE
           8             2     float     sum      -1    205.0    0.00    0.00      4    209.6    0.00    0.00      4
          16             4     float     sum      -1    208.4    0.00    0.00      8    207.6    0.00    0.00      6
          32             8     float     sum      -1    205.6    0.00    0.00     16    213.7    0.00    0.00     12
          64            16     float     sum      -1    212.3    0.00    0.00     30    201.6    0.00    0.00     24
         128            32     float     sum      -1    191.6    0.00    0.00     54    195.9    0.00    0.00     52
         256            64     float     sum      -1    215.2    0.00    0.00    114    214.2    0.00    0.00    118
         512           128     float     sum      -1    214.9    0.00    0.00    234    214.0    0.00    0.00    230
        1024           256     float     sum      -1    225.6    0.00    0.01    436    232.4    0.00    0.01    452
        2048           512     float     sum      -1    291.4    0.01    0.01    864    300.9    0.01    0.01    906
        4096          1024     float     sum      -1    319.9    0.01    0.02   1818    320.4    0.01    0.02   1768
        8192          2048     float     sum      -1    374.4    0.02    0.03   3616    365.0    0.02    0.03   3612
       16384          4096     float     sum      -1    560.4    0.03    0.04   7232    565.8    0.03    0.04   7206
       32768          8192     float     sum      -1    997.0    0.03    0.05  14402   1009.9    0.03    0.05  14318
       65536         16384     float     sum      -1   1022.6    0.06    0.10  28600   1052.1    0.06    0.09  28676
      131072         32768     float     sum      -1   1547.4    0.08    0.13  57300   1532.8    0.09    0.13  57328
      262144         65536     float     sum      -1   2802.7    0.09    0.14  114442   2804.8    0.09    0.14  114582
      524288        131072     float     sum      -1   5502.9    0.10    0.14  229370   5474.3    0.10    0.14  229484
     1048576        262144     float     sum      -1    11057    0.09    0.14  458762    11214    0.09    0.14  458174
manolinux:3319:3319 [1] NCCL INFO comm 0x5647459c65c0 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:12260:12260 [1] NCCL INFO comm 0x5604044d60d0 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:3319:3319 [1] NCCL INFO comm 0x5647478d01d0 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth    : 0.0453675
#
manolinux1:12260:12260 [1] NCCL INFO comm 0x56040477b5f0 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE

logs for proto=simple. it stucks

# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   3363 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid   3363 on  manolinux device  1 [0x03] Quadro P2200
#  Rank  2 Group  0 Pid  12319 on manolinux1 device  0 [0x0f] Quadro K620
#  Rank  3 Group  0 Pid  12319 on manolinux1 device  1 [0x28] Quadro K620
manolinux:3363:3363 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:3363:3363 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (lib                                        nccl-net.so), using internal implementation
manolinux:3363:3363 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:12319:12319 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:12319:12319 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:12319:12319 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (                                        libnccl-net.so), using internal implementation
manolinux:3363:3373 [1] NCCL INFO NET/IB : No device found.
manolinux:3363:3373 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:3363:3373 [1] NCCL INFO Using non-device net plugin version 0
manolinux:3363:3373 [1] NCCL INFO Using network Socket
manolinux:3363:3372 [0] NCCL INFO Using non-device net plugin version 0
manolinux:3363:3372 [0] NCCL INFO Using network Socket
manolinux1:12319:12327 [0] NCCL INFO NET/IB : No device found.
manolinux1:12319:12327 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:12319:12327 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:12319:12327 [0] NCCL INFO Using network Socket
manolinux1:12319:12328 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:12319:12328 [1] NCCL INFO Using network Socket
manolinux:3363:3373 [1] NCCL INFO comm 0x55e47b8eb7d0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x5d9cf78374ef3dbb - Init START
manolinux:3363:3372 [0] NCCL INFO comm 0x55e4799e1bb0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x5d9cf78374ef3dbb - Init START
manolinux1:12319:12328 [1] NCCL INFO comm 0x55811657f2a0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x5d9cf78374ef3dbb - Init START
manolinux1:12319:12327 [0] NCCL INFO comm 0x5581162d9d80 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x5d9cf78374ef3dbb - Init START
manolinux:3363:3373 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:3363:3373 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:3363:3372 [0] NCCL INFO Channel 00/02 :    0   1   2   3
manolinux:3363:3372 [0] NCCL INFO Channel 01/02 :    0   1   2   3
manolinux:3363:3372 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:3363:3372 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:12319:12328 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:12319:12328 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:12319:12327 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:12319:12327 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:3363:3372 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3363:3372 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3363:3372 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3363:3372 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3363:3373 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux1:12319:12328 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12319:12328 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:3363:3373 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12319:12327 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:3363:3373 [1] NCCL INFO Connected all rings
manolinux:3363:3373 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3363:3373 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3363:3372 [0] NCCL INFO Connected all rings
manolinux1:12319:12328 [1] NCCL INFO Connected all rings
manolinux1:12319:12328 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12319:12327 [0] NCCL INFO Connected all rings
manolinux1:12319:12328 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:3363:3372 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3363:3372 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3363:3372 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3363:3372 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12319:12328 [1] NCCL INFO Connected all trees
manolinux1:12319:12328 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:12319:12327 [0] NCCL INFO Connected all trees
manolinux1:12319:12328 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12319:12327 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:12319:12328 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12319:12327 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12319:12327 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3363:3372 [0] NCCL INFO Connected all trees
manolinux:3363:3373 [1] NCCL INFO Connected all trees
manolinux:3363:3372 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:3363:3373 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:3363:3373 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3363:3373 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3363:3372 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3363:3372 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12319:12327 [0] NCCL INFO comm 0x5581162d9d80 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x5d9cf78374ef3dbb - Init COMPLETE
manolinux1:12319:12328 [1] NCCL INFO comm 0x55811657f2a0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x5d9cf78374ef3dbb - Init COMPLETE
manolinux:3363:3372 [0] NCCL INFO comm 0x55e4799e1bb0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x5d9cf78374ef3dbb - Init COMPLETE
manolinux:3363:3373 [1] NCCL INFO comm 0x55e47b8eb7d0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x5d9cf78374ef3dbb - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    333.8    0.00    0.00      4    342.7    0.00    0.00      4
          16             4     float     sum      -1    365.2    0.00    0.00      8    420.7    0.00    0.00      6
          32             8     float     sum      -1    420.3    0.00    0.00     16    302.1    0.00    0.00     12
          64            16     float     sum      -1    297.3    0.00    0.00     30    298.8    0.00    0.00     24
         128            32     float     sum      -1    301.5    0.00    0.00     54    307.1    0.00    0.00     52
         256            64     float     sum      -1    295.8    0.00    0.00    114    314.2    0.00    0.00    118

logs for proto=LL. does not stuck but failes.

# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   3390 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid   3390 on  manolinux device  1 [0x03] Quadro P2200
#  Rank  2 Group  0 Pid  12379 on manolinux1 device  0 [0x0f] Quadro K620
#  Rank  3 Group  0 Pid  12379 on manolinux1 device  1 [0x28] Quadro K620
manolinux:3390:3390 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:3390:3390 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3390:3390 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:12379:12379 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:12379:12379 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:12379:12379 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3390:3400 [1] NCCL INFO NET/IB : No device found.
manolinux:3390:3400 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:3390:3400 [1] NCCL INFO Using non-device net plugin version 0
manolinux:3390:3400 [1] NCCL INFO Using network Socket
manolinux:3390:3399 [0] NCCL INFO Using non-device net plugin version 0
manolinux:3390:3399 [0] NCCL INFO Using network Socket
manolinux1:12379:12387 [0] NCCL INFO NET/IB : No device found.
manolinux1:12379:12387 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:12379:12387 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:12379:12387 [0] NCCL INFO Using network Socket
manolinux1:12379:12388 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:12379:12388 [1] NCCL INFO Using network Socket
manolinux:3390:3400 [1] NCCL INFO comm 0x55c8e8ebe540 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x4c2ae6bbb2819870 - Init START
manolinux:3390:3399 [0] NCCL INFO comm 0x55c8e6fb4930 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x4c2ae6bbb2819870 - Init START
manolinux1:12379:12388 [1] NCCL INFO comm 0x558cd42a7650 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x4c2ae6bbb2819870 - Init START
manolinux1:12379:12387 [0] NCCL INFO comm 0x558cd4002130 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x4c2ae6bbb2819870 - Init START
manolinux:3390:3400 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:3390:3400 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:3390:3399 [0] NCCL INFO Channel 00/02 :    0   1   2   3
manolinux:3390:3399 [0] NCCL INFO Channel 01/02 :    0   1   2   3
manolinux:3390:3399 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:3390:3399 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:12379:12388 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:12379:12388 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:12379:12387 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:12379:12387 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:3390:3400 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3390:3400 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3390:3399 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:12379:12388 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12379:12387 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12379:12387 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12379:12387 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12379:12387 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12379:12388 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Connected all rings
manolinux:3390:3399 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:12379:12388 [1] NCCL INFO Connected all rings
manolinux:3390:3400 [1] NCCL INFO Connected all rings
manolinux:3390:3400 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:12379:12388 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12379:12387 [0] NCCL INFO Connected all rings
manolinux:3390:3400 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:12379:12387 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12379:12387 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12379:12387 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12379:12388 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12379:12387 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:12379:12388 [1] NCCL INFO Connected all trees
manolinux1:12379:12387 [0] NCCL INFO Connected all trees
manolinux1:12379:12388 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:12379:12387 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:12379:12387 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12379:12387 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12379:12388 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12379:12388 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3390:3399 [0] NCCL INFO Connected all trees
manolinux:3390:3399 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:3390:3400 [1] NCCL INFO Connected all trees
manolinux:3390:3400 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:3390:3400 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3390:3400 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3390:3399 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3390:3399 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12379:12387 [0] NCCL INFO comm 0x558cd4002130 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x4c2ae6bbb2819870 - Init COMPLETE
manolinux1:12379:12388 [1] NCCL INFO comm 0x558cd42a7650 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x4c2ae6bbb2819870 - Init COMPLETE
manolinux:3390:3400 [1] NCCL INFO comm 0x55c8e8ebe540 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x4c2ae6bbb2819870 - Init COMPLETE
manolinux:3390:3399 [0] NCCL INFO comm 0x55c8e6fb4930 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x4c2ae6bbb2819870 - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    212.3    0.00    0.00      4    212.1    0.00    0.00      4
          16             4     float     sum      -1    206.9    0.00    0.00      8    214.2    0.00    0.00      6
          32             8     float     sum      -1    213.0    0.00    0.00     16    209.9    0.00    0.00     12
          64            16     float     sum      -1    212.5    0.00    0.00     30    211.8    0.00    0.00     24
         128            32     float     sum      -1    201.7    0.00    0.00     54    195.8    0.00    0.00     52
         256            64     float     sum      -1    189.4    0.00    0.00    114    197.8    0.00    0.00    118
         512           128     float     sum      -1    209.0    0.00    0.00    234    209.9    0.00    0.00    230
        1024           256     float     sum      -1    222.1    0.00    0.01    436    224.6    0.00    0.01    452
        2048           512     float     sum      -1    291.8    0.01    0.01    864    285.6    0.01    0.01    906
        4096          1024     float     sum      -1    362.3    0.01    0.02   1818    365.1    0.01    0.02   1768
        8192          2048     float     sum      -1    371.6    0.02    0.03   3616    363.1    0.02    0.03   3612
       16384          4096     float     sum      -1    707.5    0.02    0.03   7232    685.3    0.02    0.04   7206
       32768          8192     float     sum      -1   1016.1    0.03    0.05  14402   1028.8    0.03    0.05  14318
       65536         16384     float     sum      -1   2208.9    0.03    0.04  28600   2216.8    0.03    0.04  28676
      131072         32768     float     sum      -1   4245.8    0.03    0.05  57300   4235.7    0.03    0.05  57328
      262144         65536     float     sum      -1   8390.5    0.03    0.05  114442   8389.3    0.03    0.05  114582
      524288        131072     float     sum      -1    16733    0.03    0.05  229370    16727    0.03    0.05  229484
     1048576        262144     float     sum      -1    33434    0.03    0.05  458762    33450    0.03    0.05  458174
manolinux:3390:3390 [1] NCCL INFO comm 0x55c8e6fb4930 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:12379:12379 [1] NCCL INFO comm 0x558cd4002130 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:3390:3390 [1] NCCL INFO comm 0x55c8e8ebe540 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth    : 0.0216889
#
manolinux1:12379:12379 [1] NCCL INFO comm 0x558cd42a7650 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[55462,1],1]
  Exit code:    1
--------------------------------------------------------------------------

logs for algo=tree and proto =LL. it fails

# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   3417 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid   3417 on  manolinux device  1 [0x03] Quadro P2200
#  Rank  2 Group  0 Pid  12436 on manolinux1 device  0 [0x0f] Quadro K620
#  Rank  3 Group  0 Pid  12436 on manolinux1 device  1 [0x28] Quadro K620
manolinux:3417:3417 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:3417:3417 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3417:3417 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:12436:12436 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:12436:12436 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:12436:12436 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3417:3427 [1] NCCL INFO NET/IB : No device found.
manolinux:3417:3427 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:3417:3427 [1] NCCL INFO Using non-device net plugin version 0
manolinux:3417:3426 [0] NCCL INFO Using non-device net plugin version 0
manolinux:3417:3426 [0] NCCL INFO Using network Socket
manolinux:3417:3427 [1] NCCL INFO Using network Socket
manolinux1:12436:12444 [0] NCCL INFO NET/IB : No device found.
manolinux1:12436:12444 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:12436:12444 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:12436:12444 [0] NCCL INFO Using network Socket
manolinux1:12436:12445 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:12436:12445 [1] NCCL INFO Using network Socket
manolinux:3417:3427 [1] NCCL INFO comm 0x557f78b5b7c0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xf687b4268f07ea8f - Init START
manolinux:3417:3426 [0] NCCL INFO comm 0x557f76c51ba0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xf687b4268f07ea8f - Init START
manolinux1:12436:12444 [0] NCCL INFO comm 0x5623c4487dc0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xf687b4268f07ea8f - Init START
manolinux1:12436:12445 [1] NCCL INFO comm 0x5623c472d2e0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xf687b4268f07ea8f - Init START
manolinux:3417:3426 [0] NCCL INFO Channel 00/02 :    0   1   2   3
manolinux:3417:3426 [0] NCCL INFO Channel 01/02 :    0   1   2   3
manolinux:3417:3426 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:3417:3426 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:12436:12445 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:12436:12445 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:3417:3427 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:3417:3427 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:12436:12444 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:12436:12444 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:3417:3427 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3417:3427 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3417:3426 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:12436:12444 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12436:12445 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12436:12445 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12436:12444 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:3417:3426 [0] NCCL INFO Connected all rings
manolinux:3417:3426 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3417:3427 [1] NCCL INFO Connected all rings
manolinux1:12436:12444 [0] NCCL INFO Connected all rings
manolinux1:12436:12445 [1] NCCL INFO Connected all rings
manolinux:3417:3427 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3417:3427 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:12436:12445 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12436:12445 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:3417:3426 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Connected all trees
manolinux:3417:3427 [1] NCCL INFO Connected all trees
manolinux:3417:3427 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:3417:3426 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:3417:3427 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:3417:3426 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:3417:3427 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3417:3427 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3417:3426 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3417:3426 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12436:12444 [0] NCCL INFO Connected all trees
manolinux1:12436:12444 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:12436:12444 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:12436:12444 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12436:12444 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12436:12445 [1] NCCL INFO Connected all trees
manolinux1:12436:12445 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:12436:12445 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:12436:12445 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12436:12445 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3417:3426 [0] NCCL INFO comm 0x557f76c51ba0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xf687b4268f07ea8f - Init COMPLETE
manolinux:3417:3427 [1] NCCL INFO comm 0x557f78b5b7c0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xf687b4268f07ea8f - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
manolinux1:12436:12444 [0] NCCL INFO comm 0x5623c4487dc0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xf687b4268f07ea8f - Init COMPLETE
manolinux1:12436:12445 [1] NCCL INFO comm 0x5623c472d2e0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xf687b4268f07ea8f - Init COMPLETE
           8             2     float     sum      -1    155.1    0.00    0.00      4    151.4    0.00    0.00      4
          16             4     float     sum      -1    156.3    0.00    0.00      8    155.7    0.00    0.00      6
          32             8     float     sum      -1    156.2    0.00    0.00     16    201.1    0.00    0.00     12
          64            16     float     sum      -1    210.2    0.00    0.00     30    210.7    0.00    0.00     24
         128            32     float     sum      -1    213.8    0.00    0.00     54    214.4    0.00    0.00     52
         256            64     float     sum      -1    216.0    0.00    0.00    114    210.2    0.00    0.00    118
         512           128     float     sum      -1    215.7    0.00    0.00    234    216.3    0.00    0.00    230
        1024           256     float     sum      -1    229.2    0.00    0.01    436    227.4    0.00    0.01    452
        2048           512     float     sum      -1    294.9    0.01    0.01    864    303.7    0.01    0.01    906
        4096          1024     float     sum      -1    358.7    0.01    0.02   1818    353.3    0.01    0.02   1768
        8192          2048     float     sum      -1    463.9    0.02    0.03   3616    392.4    0.02    0.03   3612
       16384          4096     float     sum      -1    689.8    0.02    0.04   7232    705.3    0.02    0.03   7206
       32768          8192     float     sum      -1   1017.6    0.03    0.05  14402   1029.3    0.03    0.05  14318
       65536         16384     float     sum      -1   1749.4    0.04    0.06  28600   1735.8    0.04    0.06  28676
      131072         32768     float     sum      -1   3062.2    0.04    0.06  57300   3044.3    0.04    0.06  57328
      262144         65536     float     sum      -1   5726.5    0.05    0.07  114442   5598.7    0.05    0.07  114582
      524288        131072     float     sum      -1    11182    0.05    0.07  229370    11175    0.05    0.07  229484
     1048576        262144     float     sum      -1    22194    0.05    0.07  458762    22207    0.05    0.07  458174
manolinux:3417:3417 [1] NCCL INFO comm 0x557f76c51ba0 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:12436:12436 [1] NCCL INFO comm 0x5623c4487dc0 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:3417:3417 [1] NCCL INFO comm 0x557f78b5b7c0 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth    : 0.02695
#
manolinux1:12436:12436 [1] NCCL INFO comm 0x5623c472d2e0 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[55517,1],1]
  Exit code:    1
--------------------------------------------------------------------------
manomugdha commented 9 months ago

logs for algo=ring, proto=simple. it fails

(env-torch) mbiswas@manolinux:~/ai/pytorch/nccl-tests$ mpirun -np 2 -H 10.39.43.133,10.39.42.196 -x LD_LIBRARY_PATH -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_ALGO=RING -x NCCL_PROTO=SIMPLE ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  21894 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid  21894 on  manolinux device  1 [0x03] Quadro P2200
#  Rank  2 Group  0 Pid  29571 on manolinux1 device  0 [0x0f] Quadro K620
#  Rank  3 Group  0 Pid  29571 on manolinux1 device  1 [0x28] Quadro K620
manolinux:21894:21894 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:21894:21894 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21894:21894 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:29571:29571 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:29571:29571 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:29571:29571 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21894:21904 [1] NCCL INFO NET/IB : No device found.
manolinux:21894:21904 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:21894:21904 [1] NCCL INFO Using non-device net plugin version 0
manolinux:21894:21904 [1] NCCL INFO Using network Socket
manolinux:21894:21903 [0] NCCL INFO Using non-device net plugin version 0
manolinux:21894:21903 [0] NCCL INFO Using network Socket
manolinux1:29571:29579 [0] NCCL INFO NET/IB : No device found.
manolinux1:29571:29579 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:29571:29579 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:29571:29579 [0] NCCL INFO Using network Socket
manolinux1:29571:29580 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:29571:29580 [1] NCCL INFO Using network Socket
manolinux:21894:21904 [1] NCCL INFO comm 0x5573da6246f0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xbca260874ad1ab29 - Init START
manolinux:21894:21903 [0] NCCL INFO comm 0x5573d871aae0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbca260874ad1ab29 - Init START
manolinux1:29571:29580 [1] NCCL INFO comm 0x561e2496f640 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xbca260874ad1ab29 - Init START
manolinux1:29571:29579 [0] NCCL INFO comm 0x561e246ca120 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xbca260874ad1ab29 - Init START
manolinux:21894:21904 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:21894:21904 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:21894:21903 [0] NCCL INFO Channel 00/02 :    0   1   2   3
manolinux:21894:21903 [0] NCCL INFO Channel 01/02 :    0   1   2   3
manolinux:21894:21903 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:21894:21903 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:29571:29580 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:29571:29580 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:29571:29579 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:29571:29579 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:21894:21903 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:21894:21903 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:21894:21904 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29571:29580 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:29571:29580 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:29571:29579 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:21894:21904 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Connected all rings
manolinux1:29571:29580 [1] NCCL INFO Connected all rings
manolinux1:29571:29580 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:29571:29580 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:21894:21904 [1] NCCL INFO Connected all rings
manolinux1:29571:29579 [0] NCCL INFO Connected all rings
manolinux:21894:21904 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:21894:21904 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:29571:29579 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:21894:21904 [1] NCCL INFO Connected all trees
manolinux:21894:21903 [0] NCCL INFO Connected all trees
manolinux:21894:21903 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:21894:21904 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:21894:21903 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:21894:21904 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:21894:21904 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21894:21904 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:21894:21903 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21894:21903 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29571:29579 [0] NCCL INFO Connected all trees
manolinux1:29571:29579 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:29571:29579 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:29571:29579 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29571:29579 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29571:29580 [1] NCCL INFO Connected all trees
manolinux1:29571:29580 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:29571:29580 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:29571:29580 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29571:29580 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:21894:21904 [1] NCCL INFO comm 0x5573da6246f0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xbca260874ad1ab29 - Init COMPLETE
manolinux:21894:21903 [0] NCCL INFO comm 0x5573d871aae0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbca260874ad1ab29 - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
manolinux1:29571:29579 [0] NCCL INFO comm 0x561e246ca120 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xbca260874ad1ab29 - Init COMPLETE
manolinux1:29571:29580 [1] NCCL INFO comm 0x561e2496f640 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xbca260874ad1ab29 - Init COMPLETE
           8             2     float     sum      -1    338.4    0.00    0.00      4    351.7    0.00    0.00      4
          16             4     float     sum      -1    388.1    0.00    0.00      8    300.7    0.00    0.00      6
          32             8     float     sum      -1    436.8    0.00    0.00     16    418.1    0.00    0.00     12
          64            16     float     sum      -1    424.9    0.00    0.00     30    438.5    0.00    0.00     24
         128            32     float     sum      -1    308.2    0.00    0.00     54    290.9    0.00    0.00     52
         256            64     float     sum      -1    313.6    0.00    0.00    114    311.8    0.00    0.00    118
         512           128     float     sum      -1    415.2    0.00    0.00    234    454.8    0.00    0.00    230
        1024           256     float     sum      -1    395.3    0.00    0.00    436    394.8    0.00    0.00    452
        2048           512     float     sum      -1    371.4    0.01    0.01    864    337.9    0.01    0.01    906
        4096          1024     float     sum      -1    378.2    0.01    0.02   1818    367.6    0.01    0.02   1768
        8192          2048     float     sum      -1    502.3    0.02    0.02   3616    422.5    0.02    0.03   3612
       16384          4096     float     sum      -1    514.7    0.03    0.05   7232    516.3    0.03    0.05   7206
       32768          8192     float     sum      -1    703.5    0.05    0.07  14402    695.7    0.05    0.07  14318
       65536         16384     float     sum      -1   1237.7    0.05    0.08  28600   1245.7    0.05    0.08  28676
      131072         32768     float     sum      -1   2304.4    0.06    0.09  57300   2293.5    0.06    0.09  57328
      262144         65536     float     sum      -1   4364.4    0.06    0.09  114442   4345.7    0.06    0.09  114582
      524288        131072     float     sum      -1   8183.1    0.06    0.10  229370   8194.0    0.06    0.10  229484
     1048576        262144     float     sum      -1    16249    0.06    0.10  458762    16266    0.06    0.10  458174
manolinux:21894:21894 [1] NCCL INFO comm 0x5573d871aae0 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:29571:29571 [1] NCCL INFO comm 0x561e246ca120 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:21894:21894 [1] NCCL INFO comm 0x5573da6246f0 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth    : 0.0347557
#
manolinux1:29571:29571 [1] NCCL INFO comm 0x561e2496f640 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[33006,1],0]
  Exit code:    1
--------------------------------------------------------------------------

logs for algo=ring, proto=ll. it fails

(env-torch) mbiswas@manolinux:~/ai/pytorch/nccl-tests$ mpirun -np 2 -H 10.39.43.133,10.39.42.196 -x LD_LIBRARY_PATH -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_ALGO=RING -x NCCL_PROTO=LL ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  21921 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid  21921 on  manolinux device  1 [0x03] Quadro P2200
#  Rank  2 Group  0 Pid  29629 on manolinux1 device  0 [0x0f] Quadro K620
#  Rank  3 Group  0 Pid  29629 on manolinux1 device  1 [0x28] Quadro K620
manolinux:21921:21921 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:21921:21921 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21921:21921 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:29629:29629 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:29629:29629 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:29629:29629 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21921:21931 [1] NCCL INFO NET/IB : No device found.
manolinux:21921:21931 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:21921:21931 [1] NCCL INFO Using non-device net plugin version 0
manolinux:21921:21931 [1] NCCL INFO Using network Socket
manolinux:21921:21930 [0] NCCL INFO Using non-device net plugin version 0
manolinux:21921:21930 [0] NCCL INFO Using network Socket
manolinux1:29629:29637 [0] NCCL INFO NET/IB : No device found.
manolinux1:29629:29637 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:29629:29637 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:29629:29637 [0] NCCL INFO Using network Socket
manolinux1:29629:29638 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:29629:29638 [1] NCCL INFO Using network Socket
manolinux:21921:21931 [1] NCCL INFO comm 0x563dd0303830 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xbd06a7ca770fd61 - Init START
manolinux:21921:21930 [0] NCCL INFO comm 0x563dce3f9c20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbd06a7ca770fd61 - Init START
manolinux1:29629:29637 [0] NCCL INFO comm 0x5583fa007f00 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xbd06a7ca770fd61 - Init START
manolinux1:29629:29638 [1] NCCL INFO comm 0x5583fa2ad420 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xbd06a7ca770fd61 - Init START
manolinux:21921:21931 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:21921:21931 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:21921:21930 [0] NCCL INFO Channel 00/02 :    0   1   2   3
manolinux:21921:21930 [0] NCCL INFO Channel 01/02 :    0   1   2   3
manolinux:21921:21930 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:21921:21930 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:29629:29638 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:29629:29638 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:29629:29637 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:29629:29637 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:21921:21931 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:21921:21931 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:21921:21930 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:29629:29637 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29629:29638 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:29629:29638 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:29629:29637 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29629:29637 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:29629:29637 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:21921:21931 [1] NCCL INFO Connected all rings
manolinux:21921:21931 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:21921:21931 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:21921:21930 [0] NCCL INFO Connected all rings
manolinux1:29629:29637 [0] NCCL INFO Connected all rings
manolinux1:29629:29638 [1] NCCL INFO Connected all rings
manolinux1:29629:29638 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:29629:29638 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:29629:29637 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:29629:29637 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:29629:29637 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:29629:29637 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Connected all trees
manolinux:21921:21931 [1] NCCL INFO Connected all trees
manolinux:21921:21930 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:21921:21931 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:21921:21931 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:21921:21931 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21921:21931 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:21921:21930 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:21921:21930 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21921:21930 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29629:29638 [1] NCCL INFO Connected all trees
manolinux1:29629:29638 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:29629:29638 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:29629:29637 [0] NCCL INFO Connected all trees
manolinux1:29629:29638 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29629:29637 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:29629:29638 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29629:29637 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:29629:29637 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29629:29637 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29629:29637 [0] NCCL INFO comm 0x5583fa007f00 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xbd06a7ca770fd61 - Init COMPLETE
manolinux1:29629:29638 [1] NCCL INFO comm 0x5583fa2ad420 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xbd06a7ca770fd61 - Init COMPLETE
manolinux:21921:21930 [0] NCCL INFO comm 0x563dce3f9c20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbd06a7ca770fd61 - Init COMPLETE
manolinux:21921:21931 [1] NCCL INFO comm 0x563dd0303830 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xbd06a7ca770fd61 - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    239.8    0.00    0.00      4    223.1    0.00    0.00      4
          16             4     float     sum      -1    226.9    0.00    0.00      8    216.4    0.00    0.00      6
          32             8     float     sum      -1    224.8    0.00    0.00     16    219.4    0.00    0.00     12
          64            16     float     sum      -1    236.0    0.00    0.00     30    223.9    0.00    0.00     24
         128            32     float     sum      -1    238.8    0.00    0.00     54    242.4    0.00    0.00     52
         256            64     float     sum      -1    216.2    0.00    0.00    114    266.0    0.00    0.00    118
         512           128     float     sum      -1    320.7    0.00    0.00    234    308.2    0.00    0.00    230
        1024           256     float     sum      -1    327.3    0.00    0.00    436    343.3    0.00    0.00    452
        2048           512     float     sum      -1    414.6    0.00    0.01    864    427.0    0.00    0.01    906
        4096          1024     float     sum      -1    460.9    0.01    0.01   1818    466.3    0.01    0.01   1768
        8192          2048     float     sum      -1    467.1    0.02    0.03   3616    407.9    0.02    0.03   3612
       16384          4096     float     sum      -1    694.7    0.02    0.04   7232    713.8    0.02    0.03   7206
       32768          8192     float     sum      -1   1162.6    0.03    0.04  14402   1177.2    0.03    0.04  14318
       65536         16384     float     sum      -1   2199.8    0.03    0.04  28600   2203.3    0.03    0.04  28676
      131072         32768     float     sum      -1   4237.8    0.03    0.05  57300   4227.0    0.03    0.05  57328
      262144         65536     float     sum      -1   8375.4    0.03    0.05  114442   8388.5    0.03    0.05  114582
      524288        131072     float     sum      -1    16720    0.03    0.05  229370    16712    0.03    0.05  229484
     1048576        262144     float     sum      -1    33421    0.03    0.05  458762    33404    0.03    0.05  458174
manolinux:21921:21921 [1] NCCL INFO comm 0x563dce3f9c20 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:29629:29629 [1] NCCL INFO comm 0x5583fa007f00 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:21921:21921 [1] NCCL INFO comm 0x563dd0303830 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth    : 0.0204477
#
manolinux1:29629:29629 [1] NCCL INFO comm 0x5583fa2ad420 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32773,1],0]
  Exit code:    1
--------------------------------------------------------------------------

logs for algo=tree, proto=simple. it fails

(env-torch) mbiswas@manolinux:~/ai/pytorch/nccl-tests$ mpirun -np 2 -H 10.39.43.133,10.39.42.196 -x LD_LIBRARY_PATH -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_ALGO=TREE -x NCCL_PROTO=SIMPLE ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  21946 on  manolinux device  0 [0x01] NVIDIA RTX A2000
#  Rank  1 Group  0 Pid  21946 on  manolinux device  1 [0x03] Quadro P2200
#  Rank  2 Group  0 Pid  29687 on manolinux1 device  0 [0x0f] Quadro K620
#  Rank  3 Group  0 Pid  29687 on manolinux1 device  1 [0x28] Quadro K620
manolinux:21946:21946 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:21946:21946 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21946:21946 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:29687:29687 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:29687:29687 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:29687:29687 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21946:21955 [0] NCCL INFO NET/IB : No device found.
manolinux:21946:21955 [0] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:21946:21955 [0] NCCL INFO Using non-device net plugin version 0
manolinux:21946:21955 [0] NCCL INFO Using network Socket
manolinux:21946:21956 [1] NCCL INFO Using non-device net plugin version 0
manolinux:21946:21956 [1] NCCL INFO Using network Socket
manolinux1:29687:29695 [0] NCCL INFO NET/IB : No device found.
manolinux1:29687:29695 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:29687:29695 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:29687:29695 [0] NCCL INFO Using network Socket
manolinux1:29687:29696 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:29687:29696 [1] NCCL INFO Using network Socket
manolinux:21946:21956 [1] NCCL INFO comm 0x56276b8faa70 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x63e8b854898fb452 - Init START
manolinux:21946:21955 [0] NCCL INFO comm 0x5627699f0e60 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x63e8b854898fb452 - Init START
manolinux1:29687:29696 [1] NCCL INFO comm 0x560e27b0d810 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x63e8b854898fb452 - Init START
manolinux1:29687:29695 [0] NCCL INFO comm 0x560e278682f0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x63e8b854898fb452 - Init START
manolinux:21946:21956 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:21946:21956 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:21946:21955 [0] NCCL INFO Channel 00/02 :    0   1   2   3
manolinux:21946:21955 [0] NCCL INFO Channel 01/02 :    0   1   2   3
manolinux:21946:21955 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:21946:21955 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:29687:29696 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:29687:29696 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:29687:29695 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:29687:29695 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:21946:21956 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:21946:21956 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:29687:29696 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:29687:29696 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:21946:21955 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:21946:21955 [0] NCCL INFO Connected all rings
manolinux:21946:21955 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:29687:29696 [1] NCCL INFO Connected all rings
manolinux:21946:21956 [1] NCCL INFO Connected all rings
manolinux1:29687:29695 [0] NCCL INFO Connected all rings
manolinux1:29687:29696 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:21946:21956 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:29687:29696 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:21946:21956 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:29687:29695 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Connected all trees
manolinux:21946:21956 [1] NCCL INFO Connected all trees
manolinux:21946:21955 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:21946:21956 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:21946:21956 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:21946:21956 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21946:21956 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:21946:21955 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:21946:21955 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21946:21955 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29687:29696 [1] NCCL INFO Connected all trees
manolinux1:29687:29696 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:29687:29695 [0] NCCL INFO Connected all trees
manolinux1:29687:29696 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:29687:29695 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:29687:29695 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:29687:29696 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29687:29696 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29687:29695 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29687:29695 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:21946:21956 [1] NCCL INFO comm 0x56276b8faa70 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x63e8b854898fb452 - Init COMPLETE
manolinux:21946:21955 [0] NCCL INFO comm 0x5627699f0e60 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x63e8b854898fb452 - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
manolinux1:29687:29695 [0] NCCL INFO comm 0x560e278682f0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x63e8b854898fb452 - Init COMPLETE
manolinux1:29687:29696 [1] NCCL INFO comm 0x560e27b0d810 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x63e8b854898fb452 - Init COMPLETE
           8             2     float     sum      -1    233.9    0.00    0.00      4    225.7    0.00    0.00      4
          16             4     float     sum      -1    220.6    0.00    0.00      8    208.5    0.00    0.00      6
          32             8     float     sum      -1    203.1    0.00    0.00     16    226.6    0.00    0.00     12
          64            16     float     sum      -1    227.3    0.00    0.00     30    233.8    0.00    0.00     24
         128            32     float     sum      -1    212.2    0.00    0.00     54    226.9    0.00    0.00     52
         256            64     float     sum      -1    224.4    0.00    0.00    114    237.3    0.00    0.00    118
         512           128     float     sum      -1    236.2    0.00    0.00    234    237.1    0.00    0.00    230
        1024           256     float     sum      -1    237.6    0.00    0.01    436    239.8    0.00    0.01    452
        2048           512     float     sum      -1    241.9    0.01    0.01    864    248.0    0.01    0.01    906
        4096          1024     float     sum      -1    351.0    0.01    0.02   1818    343.7    0.01    0.02   1768
        8192          2048     float     sum      -1    366.8    0.02    0.03   3616    373.8    0.02    0.03   3612
       16384          4096     float     sum      -1    677.9    0.02    0.04   7232    691.0    0.02    0.04   7206
       32768          8192     float     sum      -1   1004.4    0.03    0.05  14402   1016.8    0.03    0.05  14318
       65536         16384     float     sum      -1   1017.3    0.06    0.10  28600   1043.9    0.06    0.09  28676
      131072         32768     float     sum      -1   1523.9    0.09    0.13  57300   1526.0    0.09    0.13  57328
      262144         65536     float     sum      -1   2816.9    0.09    0.14  114442   2801.1    0.09    0.14  114582
      524288        131072     float     sum      -1   5690.3    0.09    0.14  229370   5948.8    0.09    0.13  229484
     1048576        262144     float     sum      -1    11035    0.10    0.14  458762    11255    0.09    0.14  458174
manolinux:21946:21946 [1] NCCL INFO comm 0x5627699f0e60 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:29687:29687 [1] NCCL INFO comm 0x560e278682f0 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:21946:21946 [1] NCCL INFO comm 0x56276b8faa70 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth    : 0.0445358
#
manolinux1:29687:29687 [1] NCCL INFO comm 0x560e27b0d810 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32802,1],1]
  Exit code:    1
--------------------------------------------------------------------------
manomugdha commented 9 months ago

I changed cuda version to 12.2 but still same problem as above. test is running but fails.

sjeaugey commented 9 months ago

Ok. This looks badly broken. Unfortunately, this is a weird combination of old GPUs, so we can't really justify spending time debugging this more than I already did.

Maybe you'd be luckier with an older version of NCCL, like 2.8 or even 2.4.

manomugdha commented 9 months ago

If I have following GPUs on both machine then will that be good? or which GPUs you prefer? NVIDIA RTX A2000 Quadro P2200

sjeaugey commented 9 months ago

I don't know really. We don't have systems with any of those GPUs to try with. In general we'd advise to use a single type of GPUs, and RTX/Quadro cards are not our main focus as they're not aimed at multi-GPU training.

manomugdha commented 9 months ago

ok, will try to use single type of GPU and will let you know.

manomugdha commented 9 months ago

I replaced K620 with Quadro P22000 and nccl-tests are running fine. can you please point me to a documentation for better understanding of the test results?