Open manomugdha opened 9 months ago
I'd start by running some GPU sanity tests like nvbandwidth: https://github.com/NVIDIA/nvbandwidth
Quadro K620 GPUs are Maxwell generation. You need to recompile the NCCL perf tests adding sm_50 support, otherwise the data verification may fail. Quadro RTX A2000 is sm_86. And quadro P2200 is sm_61.
So I would advise to re-compile NCCL and the NCCL perf tests from scratch setting NVCC_GENCODE=gencode="arch=compute_50,code=sm_50 arch=compute_61,code=sm_61 gencode=arch=compute_86,code=sm_86"
. That way both NCCL and the NCCL perf tests would have support for all 3 types of GPUs.
Hi @sjeaugey , thank you for your reply. i compiled both nccl and nccl-test without specifying any arch and expected that it would include all arch. now from log i see it does include arch-86. now i have compiled nccl (after make clean) with following command and it went well.
make -j 8 src.build CUDA_HOME=/usr/ NVCC_GENCODE="-gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_86,code=sm_86"
next i compiled nccl-test (after make clean) with the following command
make MPI=1 NCCL_HOME=/home/mbiswas/ai/pytorch/nccl/build MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi CUDA_HOME=/usr/ NVCC_GENCODE="-gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_86,code=sm_86"
and seeing following warning:
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt (target: sm_60)
my ncc version is 2.19.3 current mpi version is mpirun (Open MPI) 4.1.2
i installed mpi using following command:
sudo apt-get install openmpi-bin openmpi-doc libopenmpi-dev
how to get the correct mpi version for nccl 2.19.3?
I'd start by running some GPU sanity tests like nvbandwidth: https://github.com/NVIDIA/nvbandwidth
ok, will check that.
Is there an error when recompiling the NCCL tests? Not sure why the warning happens and why it's mentioning sm_60 .. did you run make clean before recompiling?
no error, only warning. it shows warning for all 3 arch.
Linking /home/mbiswas/ai/pytorch/nccl-tests/build/scatter.o > /home/mbiswas/ai/pytorch/nccl-tests/build/scatter_perf
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt (target: sm_50)
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt (target: sm_61)
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt (target: sm_86)
yes, i ran make clean before recompiling. it seems that mpi version is not compatible.
Sorry but I don't see any error in the logs you reported. Are you unable to run the tests? You are mentioning that there is a problem with MPI but I don't see any, only that you could recompile everything just fine.
Hi @sjeaugey, i mentioned two issues at the beginning of this thread.
can you please share some light on these?
So are you saying that even after recompiling everything, you are still seeing the same exact problems (out of bound values and launch timeout) under the same conditions?
It isn't clear from your comments, for example this one:
it seems that mpi version is not compatible.
yes, even after recompiling i am seeing these two issues. and during compilation of nccl-test i am seeing this warning (nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt). what I wanted to say is that are these issues happening because of this warning? if so then how to get rid of this warning?
what I wanted to say is that are these issues happening because of this warning?
Very likely not.
how to get rid of this warning?
I don't know. I don't remember having seen that error myself.
even after recompiling i am seeing these two issues
That's quite surprising, in particular the data being reported as incorrect is typical of the verification code not being recompiled for sm_50, which is not the case by default on CUDA 11, so we've had lots of similar reports on Kepler/Maxwell architectures which were solved by making sure we'd recompile those kernels with sm_50.
this is CUDA Version: 12.2
following command runs on each node fine.
mpirun -np 1 -H 10.39.43.133,10.39.42.196 -x LD_LIBRARY_PATH ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 2
but when np value is given 2 it waits for some time and then through following error:
Test CUDA failure common.cu:291 'the launch timed out and was terminated'
this is CUDA Version: 12.2
That's weird. This is from your log:
NCCL version 2.19.3+cuda11.5
Note the "cuda11.5" in the NCCL version.
the launch timed out and was terminated
I've never seen that before, but it is a CUDA error and I don't see how that could be related to launching on multiple nodes.
following command runs on each node fine.
Did you launch that command from each node? That would be the exact same test. To launch 2 GPUs on the second node, you'd need to run:
mpirun -np 1 -H 10.39.42.196 -x LD_LIBRARY_PATH ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 2
test is running fine on individual node with 2 GPUs.
manolinux:11811:11811 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
mismatch in cuda driver version and cuda version can cause this time out error? output of nvidia-smi:
mbiswas@manolinux:~/ai/pytorch/nccl-tests$ nvidia-smi
Tue Nov 21 18:25:56 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A2000 Off | 00000000:01:00.0 Off | Off |
| 30% 34C P8 11W / 70W | 19MiB / 6138MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Quadro P2200 Off | 00000000:03:00.0 Off | N/A |
| 45% 26C P8 4W / 75W | 6MiB / 5120MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1329 G /usr/lib/xorg/Xorg 10MiB |
| 0 N/A N/A 1490 G /usr/bin/gnome-shell 3MiB |
| 1 N/A N/A 1329 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
mbiswas@manolinux:~/ai/pytorch/nccl-tests$
I ran the command from a single node when tried to run the test across 2 nodes.
I see that few people have faced this timeout issue (github) but their solution is not working for me.
mismatch in cuda driver version and cuda version can cause this time out error?
I'm not sure, but you mentioned you were using CUDA 12.2 and yet the NCCL version which is used mentions 11.5 so maybe you're not using the right NCCL library if multiple versions are installed on your system? Or is it that you recompiled NCCL with CUDA 11.5 but your driver is 12.2?
test is running fine on individual node with 2 GPUs.
That's what I'm not sure about. I would expect the CUDA launch timeout error to also happen when running on a single node. It would be helpful to share the output of both single node runs: the one with the RTX A2000 + P2200 and the one with 2 K620. There may be hints about what happening when we run on 2 + 2.
Other than that, it could be useful to try to stop Xorg on both systems, as we've seen it interact with CUDA in the past.
I'm not sure, but you mentioned you were using CUDA 12.2 and yet the NCCL version which is used mentions 11.5 so maybe you're not using the right NCCL library if multiple versions are installed on your system? Or is it that you recompiled NCCL with CUDA 11.5 but your driver is 12.2?
yes, i think in my case it is the last case i.e. nccl is compiled with 11.5 and driver is 12.2. I will update it tomorrow and will update you.
That's what I'm not sure about. I would expect the CUDA launch timeout error to also happen when running on a single node. It would be helpful to share the output of both single node runs: the one with the RTX A2000 + P2200 and the one with 2 K620. There may be hints about what happening when we run on 2 + 2.
logs for RTX A2000 + P2200:
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 2522 on manolinux device 0 [0x01] NVIDIA RTX A2000
# Rank 1 Group 0 Pid 2522 on manolinux device 1 [0x03] Quadro P2200
manolinux:2522:2522 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:2522:2522 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:2522:2522 [1] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux:2522:2533 [1] NCCL INFO NET/IB : No device found.
manolinux:2522:2533 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:2522:2533 [1] NCCL INFO Using non-device net plugin version 0
manolinux:2522:2533 [1] NCCL INFO Using network Socket
manolinux:2522:2532 [0] NCCL INFO Using non-device net plugin version 0
manolinux:2522:2532 [0] NCCL INFO Using network Socket
manolinux:2522:2533 [1] NCCL INFO comm 0x555d04cb4b70 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3000 commId 0x451afe3e3c9aaa07 - Init START
manolinux:2522:2532 [0] NCCL INFO comm 0x555d04cb00e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x451afe3e3c9aaa07 - Init START
manolinux:2522:2532 [0] graph/search.cc:1024 NCCL WARN Could not find a path for pattern 4, falling back to simple order
manolinux:2522:2532 [0] graph/search.cc:1024 NCCL WARN Could not find a path for pattern 1, falling back to simple order
manolinux:2522:2533 [1] graph/search.cc:1024 NCCL WARN Could not find a path for pattern 4, falling back to simple order
manolinux:2522:2533 [1] graph/search.cc:1024 NCCL WARN Could not find a path for pattern 1, falling back to simple order
manolinux:2522:2532 [0] NCCL INFO Channel 00/02 : 0 1
manolinux:2522:2532 [0] NCCL INFO Channel 01/02 : 0 1
manolinux:2522:2533 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:2522:2533 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:2522:2532 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
manolinux:2522:2532 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:2522:2533 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:2522:2533 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:2522:2532 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:2522:2532 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:2522:2533 [1] NCCL INFO Connected all rings
manolinux:2522:2533 [1] NCCL INFO Connected all trees
manolinux:2522:2532 [0] NCCL INFO Connected all rings
manolinux:2522:2533 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
manolinux:2522:2533 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:2522:2532 [0] NCCL INFO Connected all trees
manolinux:2522:2532 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
manolinux:2522:2532 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:2522:2532 [0] NCCL INFO comm 0x555d04cb00e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x451afe3e3c9aaa07 - Init COMPLETE
manolinux:2522:2533 [1] NCCL INFO comm 0x555d04cb4b70 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3000 commId 0x451afe3e3c9aaa07 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 12.01 0.00 0.00 0 12.29 0.00 0.00 0
16 4 float sum -1 12.39 0.00 0.00 0 12.08 0.00 0.00 0
32 8 float sum -1 12.47 0.00 0.00 0 12.52 0.00 0.00 0
64 16 float sum -1 12.70 0.01 0.01 0 12.45 0.01 0.01 0
128 32 float sum -1 12.77 0.01 0.01 0 31.98 0.00 0.00 0
256 64 float sum -1 12.94 0.02 0.02 0 12.88 0.02 0.02 0
512 128 float sum -1 30.54 0.02 0.02 0 30.22 0.02 0.02 0
1024 256 float sum -1 33.85 0.03 0.03 0 33.81 0.03 0.03 0
2048 512 float sum -1 34.56 0.06 0.06 0 34.44 0.06 0.06 0
4096 1024 float sum -1 34.90 0.12 0.12 0 34.75 0.12 0.12 0
8192 2048 float sum -1 38.11 0.21 0.21 0 37.93 0.22 0.22 0
16384 4096 float sum -1 43.94 0.37 0.37 0 43.50 0.38 0.38 0
32768 8192 float sum -1 50.83 0.64 0.64 0 51.17 0.64 0.64 0
65536 16384 float sum -1 69.18 0.95 0.95 0 69.15 0.95 0.95 0
131072 32768 float sum -1 104.6 1.25 1.25 0 103.9 1.26 1.26 0
262144 65536 float sum -1 175.2 1.50 1.50 0 174.6 1.50 1.50 0
524288 131072 float sum -1 319.0 1.64 1.64 0 317.1 1.65 1.65 0
1048576 262144 float sum -1 593.7 1.77 1.77 0 589.8 1.78 1.78 0
manolinux:2522:2522 [1] NCCL INFO comm 0x555d04cb00e0 rank 0 nranks 2 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux:2522:2522 [1] NCCL INFO comm 0x555d04cb4b70 rank 1 nranks 2 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0.478737
#
logs for K620:
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 11738 on manolinux1 device 0 [0x0f] Quadro K620
# Rank 1 Group 0 Pid 11738 on manolinux1 device 1 [0x28] Quadro K620
manolinux1:11738:11738 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:11738:11738 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux1:11738:11738 [1] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:11738:11749 [1] NCCL INFO NET/IB : No device found.
manolinux1:11738:11749 [1] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:11738:11749 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:11738:11749 [1] NCCL INFO Using network Socket
manolinux1:11738:11748 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:11738:11748 [0] NCCL INFO Using network Socket
manolinux1:11738:11749 [1] NCCL INFO comm 0x5591693c2090 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 28000 commId 0xb63cd1802bc84a6 - Init START
manolinux1:11738:11748 [0] NCCL INFO comm 0x5591693bd8e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId f000 commId 0xb63cd1802bc84a6 - Init START
manolinux1:11738:11748 [0] NCCL INFO Channel 00/02 : 0 1
manolinux1:11738:11748 [0] NCCL INFO Channel 01/02 : 0 1
manolinux1:11738:11749 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux1:11738:11748 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
manolinux1:11738:11748 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:11738:11749 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:11738:11748 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:11738:11748 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:11738:11749 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:11738:11749 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:11738:11748 [0] NCCL INFO Connected all rings
manolinux1:11738:11748 [0] NCCL INFO Connected all trees
manolinux1:11738:11749 [1] NCCL INFO Connected all rings
manolinux1:11738:11749 [1] NCCL INFO Connected all trees
manolinux1:11738:11749 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
manolinux1:11738:11749 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:11738:11748 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
manolinux1:11738:11748 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:11738:11749 [1] NCCL INFO comm 0x5591693c2090 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 28000 commId 0xb63cd1802bc84a6 - Init COMPLETE
manolinux1:11738:11748 [0] NCCL INFO comm 0x5591693bd8e0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId f000 commId 0xb63cd1802bc84a6 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 18.15 0.00 0.00 0 13.67 0.00 0.00 0
16 4 float sum -1 14.31 0.00 0.00 0 13.57 0.00 0.00 0
32 8 float sum -1 14.32 0.00 0.00 0 13.73 0.00 0.00 0
64 16 float sum -1 14.30 0.00 0.00 0 13.57 0.00 0.00 0
128 32 float sum -1 14.14 0.01 0.01 0 13.65 0.01 0.01 0
256 64 float sum -1 14.28 0.02 0.02 0 13.66 0.02 0.02 0
512 128 float sum -1 15.46 0.03 0.03 0 13.91 0.04 0.04 0
1024 256 float sum -1 16.11 0.06 0.06 0 19.46 0.05 0.05 0
2048 512 float sum -1 22.26 0.09 0.09 0 21.69 0.09 0.09 0
4096 1024 float sum -1 23.21 0.18 0.18 0 22.84 0.18 0.18 0
8192 2048 float sum -1 33.73 0.24 0.24 0 29.53 0.28 0.28 0
16384 4096 float sum -1 41.23 0.40 0.40 0 39.02 0.42 0.42 0
32768 8192 float sum -1 51.25 0.64 0.64 0 51.03 0.64 0.64 0
65536 16384 float sum -1 68.99 0.95 0.95 0 71.94 0.91 0.91 0
131072 32768 float sum -1 103.1 1.27 1.27 0 102.5 1.28 1.28 0
262144 65536 float sum -1 165.8 1.58 1.58 0 165.7 1.58 1.58 0
524288 131072 float sum -1 303.9 1.73 1.73 0 300.0 1.75 1.75 0
1048576 262144 float sum -1 572.9 1.83 1.83 0 581.4 1.80 1.80 0
manolinux1:11738:11738 [1] NCCL INFO comm 0x5591693bd8e0 rank 0 nranks 2 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux1:11738:11738 [1] NCCL INFO comm 0x5591693c2090 rank 1 nranks 2 cudaDev 1 busId 28000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0.502758
#
Other than that, it could be useful to try to stop Xorg on both systems, as we've seen it interact with CUDA in the past.
I stopped xorg from both machine. now it is stuck with the following logs:
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 2572 on manolinux device 0 [0x01] NVIDIA RTX A2000
# Rank 1 Group 0 Pid 2572 on manolinux device 1 [0x03] Quadro P2200
# Rank 2 Group 0 Pid 11844 on manolinux1 device 0 [0x0f] Quadro K620
# Rank 3 Group 0 Pid 11844 on manolinux1 device 1 [0x28] Quadro K620
manolinux:2572:2572 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:2572:2572 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:2572:2572 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:11844:11844 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:11844:11844 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:11844:11844 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:2572:2582 [1] NCCL INFO NET/IB : No device found.
manolinux:2572:2582 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:2572:2582 [1] NCCL INFO Using non-device net plugin version 0
manolinux:2572:2582 [1] NCCL INFO Using network Socket
manolinux:2572:2581 [0] NCCL INFO Using non-device net plugin version 0
manolinux:2572:2581 [0] NCCL INFO Using network Socket
manolinux1:11844:11852 [0] NCCL INFO NET/IB : No device found.
manolinux1:11844:11852 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:11844:11852 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:11844:11852 [0] NCCL INFO Using network Socket
manolinux1:11844:11853 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:11844:11853 [1] NCCL INFO Using network Socket
manolinux:2572:2582 [1] NCCL INFO comm 0x5643b1ab6760 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xcd29dd667b81b5b4 - Init START
manolinux:2572:2581 [0] NCCL INFO comm 0x5643afbacb50 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xcd29dd667b81b5b4 - Init START
manolinux1:11844:11853 [1] NCCL INFO comm 0x564b59a03620 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xcd29dd667b81b5b4 - Init START
manolinux1:11844:11852 [0] NCCL INFO comm 0x564b5975e100 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xcd29dd667b81b5b4 - Init START
manolinux1:11844:11853 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:11844:11853 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:2572:2582 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:2572:2582 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:2572:2581 [0] NCCL INFO Channel 00/02 : 0 1 2 3
manolinux:2572:2581 [0] NCCL INFO Channel 01/02 : 0 1 2 3
manolinux:2572:2581 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:2572:2581 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:11844:11852 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:11844:11852 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:2572:2582 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:2572:2582 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:11844:11852 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:11844:11853 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:2572:2581 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:11844:11853 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Connected all rings
manolinux:2572:2582 [1] NCCL INFO Connected all rings
manolinux1:11844:11853 [1] NCCL INFO Connected all rings
manolinux1:11844:11852 [0] NCCL INFO Connected all rings
manolinux:2572:2582 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:11844:11853 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:2572:2582 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:11844:11853 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:11844:11852 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:11844:11852 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:2572:2581 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:2572:2582 [1] NCCL INFO Connected all trees
manolinux:2572:2581 [0] NCCL INFO Connected all trees
manolinux:2572:2582 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:2572:2582 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:2572:2581 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:2572:2581 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:11844:11853 [1] NCCL INFO Connected all trees
manolinux1:11844:11853 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:11844:11852 [0] NCCL INFO Connected all trees
manolinux1:11844:11853 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:11844:11852 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:11844:11852 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:2572:2581 [0] NCCL INFO comm 0x5643afbacb50 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xcd29dd667b81b5b4 - Init COMPLETE
manolinux:2572:2582 [1] NCCL INFO comm 0x5643b1ab6760 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xcd29dd667b81b5b4 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
manolinux1:11844:11853 [1] NCCL INFO comm 0x564b59a03620 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xcd29dd667b81b5b4 - Init COMPLETE
manolinux1:11844:11852 [0] NCCL INFO comm 0x564b5975e100 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xcd29dd667b81b5b4 - Init COMPLETE
8 2 float sum -1 178.8 0.00 0.00 4 209.0 0.00 0.00 4
16 4 float sum -1 209.1 0.00 0.00 8 207.6 0.00 0.00 6
32 8 float sum -1 210.3 0.00 0.00 16 212.1 0.00 0.00 12
64 16 float sum -1 211.8 0.00 0.00 30 210.6 0.00 0.00 24
128 32 float sum -1 212.3 0.00 0.00 54 212.9 0.00 0.00 52
256 64 float sum -1 189.8 0.00 0.00 114 196.0 0.00 0.00 118
512 128 float sum -1 209.6 0.00 0.00 234 212.6 0.00 0.00 230
1024 256 float sum -1 232.5 0.00 0.01 436 227.2 0.00 0.01 452
2048 512 float sum -1 294.7 0.01 0.01 864 303.3 0.01 0.01 906
4096 1024 float sum -1 349.4 0.01 0.02 1818 311.4 0.01 0.02 1768
8192 2048 float sum -1 373.0 0.02 0.03 3616 372.4 0.02 0.03 3612
Thanks for bearing with me. There is progress.
Could you try the follow 4 combinations:
NCCL_ALGO=RING
/ NCCL_ALGO=TREE
× NCCL_PROTO=SIMPLE
/ NCCL_PROTO=LL
logs for algo=ring. it does not stuck but fails
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 3283 on manolinux device 0 [0x01] NVIDIA RTX A2000
# Rank 1 Group 0 Pid 3283 on manolinux device 1 [0x03] Quadro P2200
# Rank 2 Group 0 Pid 12203 on manolinux1 device 0 [0x0f] Quadro K620
# Rank 3 Group 0 Pid 12203 on manolinux1 device 1 [0x28] Quadro K620
manolinux:3283:3283 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:3283:3283 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3283:3283 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:12203:12203 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:12203:12203 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:12203:12203 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3283:3293 [1] NCCL INFO NET/IB : No device found.
manolinux:3283:3293 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:3283:3293 [1] NCCL INFO Using non-device net plugin version 0
manolinux:3283:3293 [1] NCCL INFO Using network Socket
manolinux:3283:3292 [0] NCCL INFO Using non-device net plugin version 0
manolinux:3283:3292 [0] NCCL INFO Using network Socket
manolinux1:12203:12211 [0] NCCL INFO NET/IB : No device found.
manolinux1:12203:12211 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:12203:12211 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:12203:12211 [0] NCCL INFO Using network Socket
manolinux1:12203:12212 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:12203:12212 [1] NCCL INFO Using network Socket
manolinux:3283:3293 [1] NCCL INFO comm 0x55be817d35b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x4eab9cb0e8af46b7 - Init START
manolinux:3283:3292 [0] NCCL INFO comm 0x55be7f8c99a0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x4eab9cb0e8af46b7 - Init START
manolinux1:12203:12212 [1] NCCL INFO comm 0x55f023bb4640 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x4eab9cb0e8af46b7 - Init START
manolinux1:12203:12211 [0] NCCL INFO comm 0x55f02390f120 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x4eab9cb0e8af46b7 - Init START
manolinux:3283:3293 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:3283:3293 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:3283:3292 [0] NCCL INFO Channel 00/02 : 0 1 2 3
manolinux:3283:3292 [0] NCCL INFO Channel 01/02 : 0 1 2 3
manolinux:3283:3292 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:3283:3292 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:12203:12212 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:12203:12212 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:12203:12211 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:12203:12211 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:3283:3293 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3283:3293 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux1:12203:12212 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12203:12212 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:3283:3292 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12203:12211 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:3283:3292 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3283:3292 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3283:3292 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3283:3293 [1] NCCL INFO Connected all rings
manolinux:3283:3293 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3283:3293 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3283:3292 [0] NCCL INFO Connected all rings
manolinux1:12203:12211 [0] NCCL INFO Connected all rings
manolinux1:12203:12212 [1] NCCL INFO Connected all rings
manolinux1:12203:12212 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12203:12212 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:3283:3292 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux:3283:3292 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3283:3292 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3283:3292 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12203:12211 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:3283:3293 [1] NCCL INFO Connected all trees
manolinux:3283:3292 [0] NCCL INFO Connected all trees
manolinux:3283:3293 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:3283:3292 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:3283:3293 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3283:3293 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3283:3292 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3283:3292 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12203:12212 [1] NCCL INFO Connected all trees
manolinux1:12203:12211 [0] NCCL INFO Connected all trees
manolinux1:12203:12211 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:12203:12212 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:12203:12212 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12203:12212 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12203:12211 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12203:12211 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3283:3292 [0] NCCL INFO comm 0x55be7f8c99a0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x4eab9cb0e8af46b7 - Init COMPLETE
manolinux:3283:3293 [1] NCCL INFO comm 0x55be817d35b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x4eab9cb0e8af46b7 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
manolinux1:12203:12212 [1] NCCL INFO comm 0x55f023bb4640 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x4eab9cb0e8af46b7 - Init COMPLETE
manolinux1:12203:12211 [0] NCCL INFO comm 0x55f02390f120 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x4eab9cb0e8af46b7 - Init COMPLETE
8 2 float sum -1 214.8 0.00 0.00 4 232.3 0.00 0.00 4
16 4 float sum -1 222.7 0.00 0.00 8 227.4 0.00 0.00 6
32 8 float sum -1 222.6 0.00 0.00 16 225.2 0.00 0.00 12
64 16 float sum -1 226.2 0.00 0.00 30 230.4 0.00 0.00 24
128 32 float sum -1 246.3 0.00 0.00 54 262.5 0.00 0.00 52
256 64 float sum -1 265.7 0.00 0.00 114 227.0 0.00 0.00 118
512 128 float sum -1 272.8 0.00 0.00 234 265.8 0.00 0.00 230
1024 256 float sum -1 345.4 0.00 0.00 436 271.6 0.00 0.01 452
2048 512 float sum -1 329.0 0.01 0.01 864 407.2 0.01 0.01 906
4096 1024 float sum -1 454.7 0.01 0.01 1818 403.1 0.01 0.02 1768
8192 2048 float sum -1 419.3 0.02 0.03 3616 413.1 0.02 0.03 3612
16384 4096 float sum -1 463.6 0.04 0.05 7232 457.8 0.04 0.05 7206
32768 8192 float sum -1 686.1 0.05 0.07 14402 693.6 0.05 0.07 14318
65536 16384 float sum -1 1221.9 0.05 0.08 28600 1233.9 0.05 0.08 28676
131072 32768 float sum -1 2280.2 0.06 0.09 57300 2330.5 0.06 0.08 57328
262144 65536 float sum -1 4332.0 0.06 0.09 114442 4368.4 0.06 0.09 114582
524288 131072 float sum -1 8176.5 0.06 0.10 229370 8187.9 0.06 0.10 229484
1048576 262144 float sum -1 16256 0.06 0.10 458762 16242 0.06 0.10 458174
manolinux:3283:3283 [1] NCCL INFO comm 0x55be7f8c99a0 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:12203:12203 [1] NCCL INFO comm 0x55f02390f120 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:3283:3283 [1] NCCL INFO comm 0x55be817d35b0 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth : 0.0353681
#
manolinux1:12203:12203 [1] NCCL INFO comm 0x55f023bb4640 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE
logs for algo=tree. it does not stuck but fails
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 3319 on manolinux device 0 [0x01] NVIDIA RTX A2000
# Rank 1 Group 0 Pid 3319 on manolinux device 1 [0x03] Quadro P2200
# Rank 2 Group 0 Pid 12260 on manolinux1 device 0 [0x0f] Quadro K620
# Rank 3 Group 0 Pid 12260 on manolinux1 device 1 [0x28] Quadro K620
manolinux:3319:3319 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:3319:3319 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3319:3319 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:12260:12260 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:12260:12260 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:12260:12260 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3319:3329 [1] NCCL INFO NET/IB : No device found.
manolinux:3319:3329 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:3319:3329 [1] NCCL INFO Using non-device net plugin version 0
manolinux:3319:3329 [1] NCCL INFO Using network Socket
manolinux:3319:3328 [0] NCCL INFO Using non-device net plugin version 0
manolinux:3319:3328 [0] NCCL INFO Using network Socket
manolinux1:12260:12268 [0] NCCL INFO NET/IB : No device found.
manolinux1:12260:12268 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:12260:12268 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:12260:12268 [0] NCCL INFO Using network Socket
manolinux1:12260:12269 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:12260:12269 [1] NCCL INFO Using network Socket
manolinux:3319:3328 [0] NCCL INFO comm 0x5647459c65c0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x540a9c130dcc7f4e - Init START
manolinux:3319:3329 [1] NCCL INFO comm 0x5647478d01d0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x540a9c130dcc7f4e - Init START
manolinux1:12260:12268 [0] NCCL INFO comm 0x5604044d60d0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x540a9c130dcc7f4e - Init START
manolinux1:12260:12269 [1] NCCL INFO comm 0x56040477b5f0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x540a9c130dcc7f4e - Init START
manolinux:3319:3329 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:3319:3329 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:3319:3328 [0] NCCL INFO Channel 00/02 : 0 1 2 3
manolinux:3319:3328 [0] NCCL INFO Channel 01/02 : 0 1 2 3
manolinux:3319:3328 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:3319:3328 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:12260:12269 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:12260:12269 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:12260:12268 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:12260:12268 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:3319:3328 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3319:3328 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3319:3329 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3319:3329 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12260:12269 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12260:12269 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12260:12268 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:3319:3329 [1] NCCL INFO Connected all rings
manolinux:3319:3329 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3319:3329 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3319:3328 [0] NCCL INFO Connected all rings
manolinux1:12260:12268 [0] NCCL INFO Connected all rings
manolinux1:12260:12269 [1] NCCL INFO Connected all rings
manolinux1:12260:12269 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12260:12269 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12260:12268 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12260:12268 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3319:3328 [0] NCCL INFO Connected all trees
manolinux:3319:3328 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:3319:3328 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3319:3328 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12260:12269 [1] NCCL INFO Connected all trees
manolinux1:12260:12269 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:12260:12268 [0] NCCL INFO Connected all trees
manolinux1:12260:12269 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12260:12269 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12260:12268 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:12260:12268 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12260:12268 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3319:3329 [1] NCCL INFO Connected all trees
manolinux:3319:3329 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:3319:3329 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3319:3329 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3319:3329 [1] NCCL INFO comm 0x5647478d01d0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x540a9c130dcc7f4e - Init COMPLETE
manolinux:3319:3328 [0] NCCL INFO comm 0x5647459c65c0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x540a9c130dcc7f4e - Init COMPLETE
#
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
manolinux1:12260:12268 [0] NCCL INFO comm 0x5604044d60d0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x540a9c130dcc7f4e - Init COMPLETE
manolinux1:12260:12269 [1] NCCL INFO comm 0x56040477b5f0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x540a9c130dcc7f4e - Init COMPLETE
8 2 float sum -1 205.0 0.00 0.00 4 209.6 0.00 0.00 4
16 4 float sum -1 208.4 0.00 0.00 8 207.6 0.00 0.00 6
32 8 float sum -1 205.6 0.00 0.00 16 213.7 0.00 0.00 12
64 16 float sum -1 212.3 0.00 0.00 30 201.6 0.00 0.00 24
128 32 float sum -1 191.6 0.00 0.00 54 195.9 0.00 0.00 52
256 64 float sum -1 215.2 0.00 0.00 114 214.2 0.00 0.00 118
512 128 float sum -1 214.9 0.00 0.00 234 214.0 0.00 0.00 230
1024 256 float sum -1 225.6 0.00 0.01 436 232.4 0.00 0.01 452
2048 512 float sum -1 291.4 0.01 0.01 864 300.9 0.01 0.01 906
4096 1024 float sum -1 319.9 0.01 0.02 1818 320.4 0.01 0.02 1768
8192 2048 float sum -1 374.4 0.02 0.03 3616 365.0 0.02 0.03 3612
16384 4096 float sum -1 560.4 0.03 0.04 7232 565.8 0.03 0.04 7206
32768 8192 float sum -1 997.0 0.03 0.05 14402 1009.9 0.03 0.05 14318
65536 16384 float sum -1 1022.6 0.06 0.10 28600 1052.1 0.06 0.09 28676
131072 32768 float sum -1 1547.4 0.08 0.13 57300 1532.8 0.09 0.13 57328
262144 65536 float sum -1 2802.7 0.09 0.14 114442 2804.8 0.09 0.14 114582
524288 131072 float sum -1 5502.9 0.10 0.14 229370 5474.3 0.10 0.14 229484
1048576 262144 float sum -1 11057 0.09 0.14 458762 11214 0.09 0.14 458174
manolinux:3319:3319 [1] NCCL INFO comm 0x5647459c65c0 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:12260:12260 [1] NCCL INFO comm 0x5604044d60d0 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:3319:3319 [1] NCCL INFO comm 0x5647478d01d0 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth : 0.0453675
#
manolinux1:12260:12260 [1] NCCL INFO comm 0x56040477b5f0 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE
logs for proto=simple. it stucks
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 3363 on manolinux device 0 [0x01] NVIDIA RTX A2000
# Rank 1 Group 0 Pid 3363 on manolinux device 1 [0x03] Quadro P2200
# Rank 2 Group 0 Pid 12319 on manolinux1 device 0 [0x0f] Quadro K620
# Rank 3 Group 0 Pid 12319 on manolinux1 device 1 [0x28] Quadro K620
manolinux:3363:3363 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:3363:3363 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (lib nccl-net.so), using internal implementation
manolinux:3363:3363 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:12319:12319 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:12319:12319 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:12319:12319 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found ( libnccl-net.so), using internal implementation
manolinux:3363:3373 [1] NCCL INFO NET/IB : No device found.
manolinux:3363:3373 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:3363:3373 [1] NCCL INFO Using non-device net plugin version 0
manolinux:3363:3373 [1] NCCL INFO Using network Socket
manolinux:3363:3372 [0] NCCL INFO Using non-device net plugin version 0
manolinux:3363:3372 [0] NCCL INFO Using network Socket
manolinux1:12319:12327 [0] NCCL INFO NET/IB : No device found.
manolinux1:12319:12327 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:12319:12327 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:12319:12327 [0] NCCL INFO Using network Socket
manolinux1:12319:12328 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:12319:12328 [1] NCCL INFO Using network Socket
manolinux:3363:3373 [1] NCCL INFO comm 0x55e47b8eb7d0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x5d9cf78374ef3dbb - Init START
manolinux:3363:3372 [0] NCCL INFO comm 0x55e4799e1bb0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x5d9cf78374ef3dbb - Init START
manolinux1:12319:12328 [1] NCCL INFO comm 0x55811657f2a0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x5d9cf78374ef3dbb - Init START
manolinux1:12319:12327 [0] NCCL INFO comm 0x5581162d9d80 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x5d9cf78374ef3dbb - Init START
manolinux:3363:3373 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:3363:3373 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:3363:3372 [0] NCCL INFO Channel 00/02 : 0 1 2 3
manolinux:3363:3372 [0] NCCL INFO Channel 01/02 : 0 1 2 3
manolinux:3363:3372 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:3363:3372 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:12319:12328 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:12319:12328 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:12319:12327 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:12319:12327 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:3363:3372 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3363:3372 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3363:3372 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3363:3372 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3363:3373 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux1:12319:12328 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12319:12328 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:3363:3373 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12319:12327 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:3363:3373 [1] NCCL INFO Connected all rings
manolinux:3363:3373 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3363:3373 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3363:3372 [0] NCCL INFO Connected all rings
manolinux1:12319:12328 [1] NCCL INFO Connected all rings
manolinux1:12319:12328 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12319:12327 [0] NCCL INFO Connected all rings
manolinux1:12319:12328 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:3363:3372 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3363:3372 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3363:3372 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3363:3372 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12319:12327 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12319:12328 [1] NCCL INFO Connected all trees
manolinux1:12319:12328 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:12319:12327 [0] NCCL INFO Connected all trees
manolinux1:12319:12328 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12319:12327 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:12319:12328 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12319:12327 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12319:12327 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3363:3372 [0] NCCL INFO Connected all trees
manolinux:3363:3373 [1] NCCL INFO Connected all trees
manolinux:3363:3372 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:3363:3373 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:3363:3373 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3363:3373 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3363:3372 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3363:3372 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12319:12327 [0] NCCL INFO comm 0x5581162d9d80 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x5d9cf78374ef3dbb - Init COMPLETE
manolinux1:12319:12328 [1] NCCL INFO comm 0x55811657f2a0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x5d9cf78374ef3dbb - Init COMPLETE
manolinux:3363:3372 [0] NCCL INFO comm 0x55e4799e1bb0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x5d9cf78374ef3dbb - Init COMPLETE
manolinux:3363:3373 [1] NCCL INFO comm 0x55e47b8eb7d0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x5d9cf78374ef3dbb - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 333.8 0.00 0.00 4 342.7 0.00 0.00 4
16 4 float sum -1 365.2 0.00 0.00 8 420.7 0.00 0.00 6
32 8 float sum -1 420.3 0.00 0.00 16 302.1 0.00 0.00 12
64 16 float sum -1 297.3 0.00 0.00 30 298.8 0.00 0.00 24
128 32 float sum -1 301.5 0.00 0.00 54 307.1 0.00 0.00 52
256 64 float sum -1 295.8 0.00 0.00 114 314.2 0.00 0.00 118
logs for proto=LL. does not stuck but failes.
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 3390 on manolinux device 0 [0x01] NVIDIA RTX A2000
# Rank 1 Group 0 Pid 3390 on manolinux device 1 [0x03] Quadro P2200
# Rank 2 Group 0 Pid 12379 on manolinux1 device 0 [0x0f] Quadro K620
# Rank 3 Group 0 Pid 12379 on manolinux1 device 1 [0x28] Quadro K620
manolinux:3390:3390 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:3390:3390 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3390:3390 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:12379:12379 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:12379:12379 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:12379:12379 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3390:3400 [1] NCCL INFO NET/IB : No device found.
manolinux:3390:3400 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:3390:3400 [1] NCCL INFO Using non-device net plugin version 0
manolinux:3390:3400 [1] NCCL INFO Using network Socket
manolinux:3390:3399 [0] NCCL INFO Using non-device net plugin version 0
manolinux:3390:3399 [0] NCCL INFO Using network Socket
manolinux1:12379:12387 [0] NCCL INFO NET/IB : No device found.
manolinux1:12379:12387 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:12379:12387 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:12379:12387 [0] NCCL INFO Using network Socket
manolinux1:12379:12388 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:12379:12388 [1] NCCL INFO Using network Socket
manolinux:3390:3400 [1] NCCL INFO comm 0x55c8e8ebe540 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x4c2ae6bbb2819870 - Init START
manolinux:3390:3399 [0] NCCL INFO comm 0x55c8e6fb4930 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x4c2ae6bbb2819870 - Init START
manolinux1:12379:12388 [1] NCCL INFO comm 0x558cd42a7650 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x4c2ae6bbb2819870 - Init START
manolinux1:12379:12387 [0] NCCL INFO comm 0x558cd4002130 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x4c2ae6bbb2819870 - Init START
manolinux:3390:3400 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:3390:3400 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:3390:3399 [0] NCCL INFO Channel 00/02 : 0 1 2 3
manolinux:3390:3399 [0] NCCL INFO Channel 01/02 : 0 1 2 3
manolinux:3390:3399 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:3390:3399 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:12379:12388 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:12379:12388 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:12379:12387 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:12379:12387 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:3390:3400 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3390:3400 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3390:3399 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:12379:12388 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12379:12387 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12379:12387 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12379:12387 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12379:12387 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12379:12388 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Connected all rings
manolinux:3390:3399 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:12379:12388 [1] NCCL INFO Connected all rings
manolinux:3390:3400 [1] NCCL INFO Connected all rings
manolinux:3390:3400 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:12379:12388 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12379:12387 [0] NCCL INFO Connected all rings
manolinux:3390:3400 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:12379:12387 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12379:12387 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12379:12387 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12379:12388 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12379:12387 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3390:3399 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:12379:12388 [1] NCCL INFO Connected all trees
manolinux1:12379:12387 [0] NCCL INFO Connected all trees
manolinux1:12379:12388 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:12379:12387 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:12379:12387 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12379:12387 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12379:12388 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12379:12388 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3390:3399 [0] NCCL INFO Connected all trees
manolinux:3390:3399 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:3390:3400 [1] NCCL INFO Connected all trees
manolinux:3390:3400 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:3390:3400 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3390:3400 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3390:3399 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3390:3399 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12379:12387 [0] NCCL INFO comm 0x558cd4002130 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x4c2ae6bbb2819870 - Init COMPLETE
manolinux1:12379:12388 [1] NCCL INFO comm 0x558cd42a7650 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x4c2ae6bbb2819870 - Init COMPLETE
manolinux:3390:3400 [1] NCCL INFO comm 0x55c8e8ebe540 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x4c2ae6bbb2819870 - Init COMPLETE
manolinux:3390:3399 [0] NCCL INFO comm 0x55c8e6fb4930 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x4c2ae6bbb2819870 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 212.3 0.00 0.00 4 212.1 0.00 0.00 4
16 4 float sum -1 206.9 0.00 0.00 8 214.2 0.00 0.00 6
32 8 float sum -1 213.0 0.00 0.00 16 209.9 0.00 0.00 12
64 16 float sum -1 212.5 0.00 0.00 30 211.8 0.00 0.00 24
128 32 float sum -1 201.7 0.00 0.00 54 195.8 0.00 0.00 52
256 64 float sum -1 189.4 0.00 0.00 114 197.8 0.00 0.00 118
512 128 float sum -1 209.0 0.00 0.00 234 209.9 0.00 0.00 230
1024 256 float sum -1 222.1 0.00 0.01 436 224.6 0.00 0.01 452
2048 512 float sum -1 291.8 0.01 0.01 864 285.6 0.01 0.01 906
4096 1024 float sum -1 362.3 0.01 0.02 1818 365.1 0.01 0.02 1768
8192 2048 float sum -1 371.6 0.02 0.03 3616 363.1 0.02 0.03 3612
16384 4096 float sum -1 707.5 0.02 0.03 7232 685.3 0.02 0.04 7206
32768 8192 float sum -1 1016.1 0.03 0.05 14402 1028.8 0.03 0.05 14318
65536 16384 float sum -1 2208.9 0.03 0.04 28600 2216.8 0.03 0.04 28676
131072 32768 float sum -1 4245.8 0.03 0.05 57300 4235.7 0.03 0.05 57328
262144 65536 float sum -1 8390.5 0.03 0.05 114442 8389.3 0.03 0.05 114582
524288 131072 float sum -1 16733 0.03 0.05 229370 16727 0.03 0.05 229484
1048576 262144 float sum -1 33434 0.03 0.05 458762 33450 0.03 0.05 458174
manolinux:3390:3390 [1] NCCL INFO comm 0x55c8e6fb4930 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:12379:12379 [1] NCCL INFO comm 0x558cd4002130 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:3390:3390 [1] NCCL INFO comm 0x55c8e8ebe540 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth : 0.0216889
#
manolinux1:12379:12379 [1] NCCL INFO comm 0x558cd42a7650 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[55462,1],1]
Exit code: 1
--------------------------------------------------------------------------
logs for algo=tree and proto =LL. it fails
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 3417 on manolinux device 0 [0x01] NVIDIA RTX A2000
# Rank 1 Group 0 Pid 3417 on manolinux device 1 [0x03] Quadro P2200
# Rank 2 Group 0 Pid 12436 on manolinux1 device 0 [0x0f] Quadro K620
# Rank 3 Group 0 Pid 12436 on manolinux1 device 1 [0x28] Quadro K620
manolinux:3417:3417 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:3417:3417 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3417:3417 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:12436:12436 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:12436:12436 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:12436:12436 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:3417:3427 [1] NCCL INFO NET/IB : No device found.
manolinux:3417:3427 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:3417:3427 [1] NCCL INFO Using non-device net plugin version 0
manolinux:3417:3426 [0] NCCL INFO Using non-device net plugin version 0
manolinux:3417:3426 [0] NCCL INFO Using network Socket
manolinux:3417:3427 [1] NCCL INFO Using network Socket
manolinux1:12436:12444 [0] NCCL INFO NET/IB : No device found.
manolinux1:12436:12444 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:12436:12444 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:12436:12444 [0] NCCL INFO Using network Socket
manolinux1:12436:12445 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:12436:12445 [1] NCCL INFO Using network Socket
manolinux:3417:3427 [1] NCCL INFO comm 0x557f78b5b7c0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xf687b4268f07ea8f - Init START
manolinux:3417:3426 [0] NCCL INFO comm 0x557f76c51ba0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xf687b4268f07ea8f - Init START
manolinux1:12436:12444 [0] NCCL INFO comm 0x5623c4487dc0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xf687b4268f07ea8f - Init START
manolinux1:12436:12445 [1] NCCL INFO comm 0x5623c472d2e0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xf687b4268f07ea8f - Init START
manolinux:3417:3426 [0] NCCL INFO Channel 00/02 : 0 1 2 3
manolinux:3417:3426 [0] NCCL INFO Channel 01/02 : 0 1 2 3
manolinux:3417:3426 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:3417:3426 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:12436:12445 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:12436:12445 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:3417:3427 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:3417:3427 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:12436:12444 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:12436:12444 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:3417:3427 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3417:3427 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:3417:3426 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:12436:12444 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12436:12445 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12436:12445 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:12436:12444 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:3417:3426 [0] NCCL INFO Connected all rings
manolinux:3417:3426 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3417:3427 [1] NCCL INFO Connected all rings
manolinux1:12436:12444 [0] NCCL INFO Connected all rings
manolinux1:12436:12445 [1] NCCL INFO Connected all rings
manolinux:3417:3427 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:3417:3427 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:12436:12445 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:12436:12445 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:3417:3426 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:12436:12444 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:3417:3426 [0] NCCL INFO Connected all trees
manolinux:3417:3427 [1] NCCL INFO Connected all trees
manolinux:3417:3427 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:3417:3426 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:3417:3427 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:3417:3426 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:3417:3427 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3417:3427 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3417:3426 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:3417:3426 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12436:12444 [0] NCCL INFO Connected all trees
manolinux1:12436:12444 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:12436:12444 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:12436:12444 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12436:12444 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:12436:12445 [1] NCCL INFO Connected all trees
manolinux1:12436:12445 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:12436:12445 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:12436:12445 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:12436:12445 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:3417:3426 [0] NCCL INFO comm 0x557f76c51ba0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xf687b4268f07ea8f - Init COMPLETE
manolinux:3417:3427 [1] NCCL INFO comm 0x557f78b5b7c0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xf687b4268f07ea8f - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
manolinux1:12436:12444 [0] NCCL INFO comm 0x5623c4487dc0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xf687b4268f07ea8f - Init COMPLETE
manolinux1:12436:12445 [1] NCCL INFO comm 0x5623c472d2e0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xf687b4268f07ea8f - Init COMPLETE
8 2 float sum -1 155.1 0.00 0.00 4 151.4 0.00 0.00 4
16 4 float sum -1 156.3 0.00 0.00 8 155.7 0.00 0.00 6
32 8 float sum -1 156.2 0.00 0.00 16 201.1 0.00 0.00 12
64 16 float sum -1 210.2 0.00 0.00 30 210.7 0.00 0.00 24
128 32 float sum -1 213.8 0.00 0.00 54 214.4 0.00 0.00 52
256 64 float sum -1 216.0 0.00 0.00 114 210.2 0.00 0.00 118
512 128 float sum -1 215.7 0.00 0.00 234 216.3 0.00 0.00 230
1024 256 float sum -1 229.2 0.00 0.01 436 227.4 0.00 0.01 452
2048 512 float sum -1 294.9 0.01 0.01 864 303.7 0.01 0.01 906
4096 1024 float sum -1 358.7 0.01 0.02 1818 353.3 0.01 0.02 1768
8192 2048 float sum -1 463.9 0.02 0.03 3616 392.4 0.02 0.03 3612
16384 4096 float sum -1 689.8 0.02 0.04 7232 705.3 0.02 0.03 7206
32768 8192 float sum -1 1017.6 0.03 0.05 14402 1029.3 0.03 0.05 14318
65536 16384 float sum -1 1749.4 0.04 0.06 28600 1735.8 0.04 0.06 28676
131072 32768 float sum -1 3062.2 0.04 0.06 57300 3044.3 0.04 0.06 57328
262144 65536 float sum -1 5726.5 0.05 0.07 114442 5598.7 0.05 0.07 114582
524288 131072 float sum -1 11182 0.05 0.07 229370 11175 0.05 0.07 229484
1048576 262144 float sum -1 22194 0.05 0.07 458762 22207 0.05 0.07 458174
manolinux:3417:3417 [1] NCCL INFO comm 0x557f76c51ba0 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:12436:12436 [1] NCCL INFO comm 0x5623c4487dc0 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:3417:3417 [1] NCCL INFO comm 0x557f78b5b7c0 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth : 0.02695
#
manolinux1:12436:12436 [1] NCCL INFO comm 0x5623c472d2e0 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[55517,1],1]
Exit code: 1
--------------------------------------------------------------------------
logs for algo=ring, proto=simple. it fails
(env-torch) mbiswas@manolinux:~/ai/pytorch/nccl-tests$ mpirun -np 2 -H 10.39.43.133,10.39.42.196 -x LD_LIBRARY_PATH -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_ALGO=RING -x NCCL_PROTO=SIMPLE ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 21894 on manolinux device 0 [0x01] NVIDIA RTX A2000
# Rank 1 Group 0 Pid 21894 on manolinux device 1 [0x03] Quadro P2200
# Rank 2 Group 0 Pid 29571 on manolinux1 device 0 [0x0f] Quadro K620
# Rank 3 Group 0 Pid 29571 on manolinux1 device 1 [0x28] Quadro K620
manolinux:21894:21894 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:21894:21894 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21894:21894 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:29571:29571 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:29571:29571 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:29571:29571 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21894:21904 [1] NCCL INFO NET/IB : No device found.
manolinux:21894:21904 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:21894:21904 [1] NCCL INFO Using non-device net plugin version 0
manolinux:21894:21904 [1] NCCL INFO Using network Socket
manolinux:21894:21903 [0] NCCL INFO Using non-device net plugin version 0
manolinux:21894:21903 [0] NCCL INFO Using network Socket
manolinux1:29571:29579 [0] NCCL INFO NET/IB : No device found.
manolinux1:29571:29579 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:29571:29579 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:29571:29579 [0] NCCL INFO Using network Socket
manolinux1:29571:29580 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:29571:29580 [1] NCCL INFO Using network Socket
manolinux:21894:21904 [1] NCCL INFO comm 0x5573da6246f0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xbca260874ad1ab29 - Init START
manolinux:21894:21903 [0] NCCL INFO comm 0x5573d871aae0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbca260874ad1ab29 - Init START
manolinux1:29571:29580 [1] NCCL INFO comm 0x561e2496f640 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xbca260874ad1ab29 - Init START
manolinux1:29571:29579 [0] NCCL INFO comm 0x561e246ca120 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xbca260874ad1ab29 - Init START
manolinux:21894:21904 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:21894:21904 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:21894:21903 [0] NCCL INFO Channel 00/02 : 0 1 2 3
manolinux:21894:21903 [0] NCCL INFO Channel 01/02 : 0 1 2 3
manolinux:21894:21903 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:21894:21903 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:29571:29580 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:29571:29580 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:29571:29579 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:29571:29579 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:21894:21903 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:21894:21903 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:21894:21904 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29571:29580 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:29571:29580 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:29571:29579 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:21894:21904 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Connected all rings
manolinux1:29571:29580 [1] NCCL INFO Connected all rings
manolinux1:29571:29580 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:29571:29580 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:21894:21904 [1] NCCL INFO Connected all rings
manolinux1:29571:29579 [0] NCCL INFO Connected all rings
manolinux:21894:21904 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:21894:21904 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:29571:29579 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:29571:29579 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:21894:21903 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:21894:21904 [1] NCCL INFO Connected all trees
manolinux:21894:21903 [0] NCCL INFO Connected all trees
manolinux:21894:21903 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:21894:21904 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:21894:21903 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:21894:21904 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:21894:21904 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21894:21904 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:21894:21903 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21894:21903 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29571:29579 [0] NCCL INFO Connected all trees
manolinux1:29571:29579 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:29571:29579 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:29571:29579 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29571:29579 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29571:29580 [1] NCCL INFO Connected all trees
manolinux1:29571:29580 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:29571:29580 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:29571:29580 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29571:29580 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:21894:21904 [1] NCCL INFO comm 0x5573da6246f0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xbca260874ad1ab29 - Init COMPLETE
manolinux:21894:21903 [0] NCCL INFO comm 0x5573d871aae0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbca260874ad1ab29 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
manolinux1:29571:29579 [0] NCCL INFO comm 0x561e246ca120 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xbca260874ad1ab29 - Init COMPLETE
manolinux1:29571:29580 [1] NCCL INFO comm 0x561e2496f640 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xbca260874ad1ab29 - Init COMPLETE
8 2 float sum -1 338.4 0.00 0.00 4 351.7 0.00 0.00 4
16 4 float sum -1 388.1 0.00 0.00 8 300.7 0.00 0.00 6
32 8 float sum -1 436.8 0.00 0.00 16 418.1 0.00 0.00 12
64 16 float sum -1 424.9 0.00 0.00 30 438.5 0.00 0.00 24
128 32 float sum -1 308.2 0.00 0.00 54 290.9 0.00 0.00 52
256 64 float sum -1 313.6 0.00 0.00 114 311.8 0.00 0.00 118
512 128 float sum -1 415.2 0.00 0.00 234 454.8 0.00 0.00 230
1024 256 float sum -1 395.3 0.00 0.00 436 394.8 0.00 0.00 452
2048 512 float sum -1 371.4 0.01 0.01 864 337.9 0.01 0.01 906
4096 1024 float sum -1 378.2 0.01 0.02 1818 367.6 0.01 0.02 1768
8192 2048 float sum -1 502.3 0.02 0.02 3616 422.5 0.02 0.03 3612
16384 4096 float sum -1 514.7 0.03 0.05 7232 516.3 0.03 0.05 7206
32768 8192 float sum -1 703.5 0.05 0.07 14402 695.7 0.05 0.07 14318
65536 16384 float sum -1 1237.7 0.05 0.08 28600 1245.7 0.05 0.08 28676
131072 32768 float sum -1 2304.4 0.06 0.09 57300 2293.5 0.06 0.09 57328
262144 65536 float sum -1 4364.4 0.06 0.09 114442 4345.7 0.06 0.09 114582
524288 131072 float sum -1 8183.1 0.06 0.10 229370 8194.0 0.06 0.10 229484
1048576 262144 float sum -1 16249 0.06 0.10 458762 16266 0.06 0.10 458174
manolinux:21894:21894 [1] NCCL INFO comm 0x5573d871aae0 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:29571:29571 [1] NCCL INFO comm 0x561e246ca120 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:21894:21894 [1] NCCL INFO comm 0x5573da6246f0 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth : 0.0347557
#
manolinux1:29571:29571 [1] NCCL INFO comm 0x561e2496f640 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[33006,1],0]
Exit code: 1
--------------------------------------------------------------------------
logs for algo=ring, proto=ll. it fails
(env-torch) mbiswas@manolinux:~/ai/pytorch/nccl-tests$ mpirun -np 2 -H 10.39.43.133,10.39.42.196 -x LD_LIBRARY_PATH -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_ALGO=RING -x NCCL_PROTO=LL ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 21921 on manolinux device 0 [0x01] NVIDIA RTX A2000
# Rank 1 Group 0 Pid 21921 on manolinux device 1 [0x03] Quadro P2200
# Rank 2 Group 0 Pid 29629 on manolinux1 device 0 [0x0f] Quadro K620
# Rank 3 Group 0 Pid 29629 on manolinux1 device 1 [0x28] Quadro K620
manolinux:21921:21921 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:21921:21921 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21921:21921 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:29629:29629 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:29629:29629 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:29629:29629 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21921:21931 [1] NCCL INFO NET/IB : No device found.
manolinux:21921:21931 [1] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:21921:21931 [1] NCCL INFO Using non-device net plugin version 0
manolinux:21921:21931 [1] NCCL INFO Using network Socket
manolinux:21921:21930 [0] NCCL INFO Using non-device net plugin version 0
manolinux:21921:21930 [0] NCCL INFO Using network Socket
manolinux1:29629:29637 [0] NCCL INFO NET/IB : No device found.
manolinux1:29629:29637 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:29629:29637 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:29629:29637 [0] NCCL INFO Using network Socket
manolinux1:29629:29638 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:29629:29638 [1] NCCL INFO Using network Socket
manolinux:21921:21931 [1] NCCL INFO comm 0x563dd0303830 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xbd06a7ca770fd61 - Init START
manolinux:21921:21930 [0] NCCL INFO comm 0x563dce3f9c20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbd06a7ca770fd61 - Init START
manolinux1:29629:29637 [0] NCCL INFO comm 0x5583fa007f00 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xbd06a7ca770fd61 - Init START
manolinux1:29629:29638 [1] NCCL INFO comm 0x5583fa2ad420 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xbd06a7ca770fd61 - Init START
manolinux:21921:21931 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:21921:21931 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:21921:21930 [0] NCCL INFO Channel 00/02 : 0 1 2 3
manolinux:21921:21930 [0] NCCL INFO Channel 01/02 : 0 1 2 3
manolinux:21921:21930 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:21921:21930 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:29629:29638 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:29629:29638 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:29629:29637 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:29629:29637 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:21921:21931 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:21921:21931 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:21921:21930 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux1:29629:29637 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29629:29638 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:29629:29638 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:29629:29637 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29629:29637 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:29629:29637 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux:21921:21931 [1] NCCL INFO Connected all rings
manolinux:21921:21931 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:21921:21931 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux:21921:21930 [0] NCCL INFO Connected all rings
manolinux1:29629:29637 [0] NCCL INFO Connected all rings
manolinux1:29629:29638 [1] NCCL INFO Connected all rings
manolinux1:29629:29638 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:29629:29638 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux1:29629:29637 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:29629:29637 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux1:29629:29637 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:29629:29637 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:21921:21930 [0] NCCL INFO Connected all trees
manolinux:21921:21931 [1] NCCL INFO Connected all trees
manolinux:21921:21930 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:21921:21931 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux:21921:21931 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:21921:21931 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21921:21931 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:21921:21930 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux:21921:21930 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21921:21930 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29629:29638 [1] NCCL INFO Connected all trees
manolinux1:29629:29638 [1] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:29629:29638 [1] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:29629:29637 [0] NCCL INFO Connected all trees
manolinux1:29629:29638 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29629:29637 [0] NCCL INFO NCCL_PROTO set by environment to LL
manolinux1:29629:29638 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29629:29637 [0] NCCL INFO NCCL_ALGO set by environment to RING
manolinux1:29629:29637 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29629:29637 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29629:29637 [0] NCCL INFO comm 0x5583fa007f00 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0xbd06a7ca770fd61 - Init COMPLETE
manolinux1:29629:29638 [1] NCCL INFO comm 0x5583fa2ad420 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0xbd06a7ca770fd61 - Init COMPLETE
manolinux:21921:21930 [0] NCCL INFO comm 0x563dce3f9c20 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbd06a7ca770fd61 - Init COMPLETE
manolinux:21921:21931 [1] NCCL INFO comm 0x563dd0303830 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0xbd06a7ca770fd61 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 239.8 0.00 0.00 4 223.1 0.00 0.00 4
16 4 float sum -1 226.9 0.00 0.00 8 216.4 0.00 0.00 6
32 8 float sum -1 224.8 0.00 0.00 16 219.4 0.00 0.00 12
64 16 float sum -1 236.0 0.00 0.00 30 223.9 0.00 0.00 24
128 32 float sum -1 238.8 0.00 0.00 54 242.4 0.00 0.00 52
256 64 float sum -1 216.2 0.00 0.00 114 266.0 0.00 0.00 118
512 128 float sum -1 320.7 0.00 0.00 234 308.2 0.00 0.00 230
1024 256 float sum -1 327.3 0.00 0.00 436 343.3 0.00 0.00 452
2048 512 float sum -1 414.6 0.00 0.01 864 427.0 0.00 0.01 906
4096 1024 float sum -1 460.9 0.01 0.01 1818 466.3 0.01 0.01 1768
8192 2048 float sum -1 467.1 0.02 0.03 3616 407.9 0.02 0.03 3612
16384 4096 float sum -1 694.7 0.02 0.04 7232 713.8 0.02 0.03 7206
32768 8192 float sum -1 1162.6 0.03 0.04 14402 1177.2 0.03 0.04 14318
65536 16384 float sum -1 2199.8 0.03 0.04 28600 2203.3 0.03 0.04 28676
131072 32768 float sum -1 4237.8 0.03 0.05 57300 4227.0 0.03 0.05 57328
262144 65536 float sum -1 8375.4 0.03 0.05 114442 8388.5 0.03 0.05 114582
524288 131072 float sum -1 16720 0.03 0.05 229370 16712 0.03 0.05 229484
1048576 262144 float sum -1 33421 0.03 0.05 458762 33404 0.03 0.05 458174
manolinux:21921:21921 [1] NCCL INFO comm 0x563dce3f9c20 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:29629:29629 [1] NCCL INFO comm 0x5583fa007f00 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:21921:21921 [1] NCCL INFO comm 0x563dd0303830 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth : 0.0204477
#
manolinux1:29629:29629 [1] NCCL INFO comm 0x5583fa2ad420 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[32773,1],0]
Exit code: 1
--------------------------------------------------------------------------
logs for algo=tree, proto=simple. it fails
(env-torch) mbiswas@manolinux:~/ai/pytorch/nccl-tests$ mpirun -np 2 -H 10.39.43.133,10.39.42.196 -x LD_LIBRARY_PATH -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_ALGO=TREE -x NCCL_PROTO=SIMPLE ./build/all_reduce_perf -b 8 -e 1M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 1048576 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 21946 on manolinux device 0 [0x01] NVIDIA RTX A2000
# Rank 1 Group 0 Pid 21946 on manolinux device 1 [0x03] Quadro P2200
# Rank 2 Group 0 Pid 29687 on manolinux1 device 0 [0x0f] Quadro K620
# Rank 3 Group 0 Pid 29687 on manolinux1 device 1 [0x28] Quadro K620
manolinux:21946:21946 [0] NCCL INFO Bootstrap : Using eno1:10.39.43.133<0>
manolinux:21946:21946 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21946:21946 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.19.3+cuda11.5
manolinux1:29687:29687 [0] NCCL INFO cudaDriverVersion 12020
manolinux1:29687:29687 [0] NCCL INFO Bootstrap : Using enp1s0:10.39.42.196<0>
manolinux1:29687:29687 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
manolinux:21946:21955 [0] NCCL INFO NET/IB : No device found.
manolinux:21946:21955 [0] NCCL INFO NET/Socket : Using [0]eno1:10.39.43.133<0>
manolinux:21946:21955 [0] NCCL INFO Using non-device net plugin version 0
manolinux:21946:21955 [0] NCCL INFO Using network Socket
manolinux:21946:21956 [1] NCCL INFO Using non-device net plugin version 0
manolinux:21946:21956 [1] NCCL INFO Using network Socket
manolinux1:29687:29695 [0] NCCL INFO NET/IB : No device found.
manolinux1:29687:29695 [0] NCCL INFO NET/Socket : Using [0]enp1s0:10.39.42.196<0>
manolinux1:29687:29695 [0] NCCL INFO Using non-device net plugin version 0
manolinux1:29687:29695 [0] NCCL INFO Using network Socket
manolinux1:29687:29696 [1] NCCL INFO Using non-device net plugin version 0
manolinux1:29687:29696 [1] NCCL INFO Using network Socket
manolinux:21946:21956 [1] NCCL INFO comm 0x56276b8faa70 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x63e8b854898fb452 - Init START
manolinux:21946:21955 [0] NCCL INFO comm 0x5627699f0e60 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x63e8b854898fb452 - Init START
manolinux1:29687:29696 [1] NCCL INFO comm 0x560e27b0d810 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x63e8b854898fb452 - Init START
manolinux1:29687:29695 [0] NCCL INFO comm 0x560e278682f0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x63e8b854898fb452 - Init START
manolinux:21946:21956 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
manolinux:21946:21956 [1] NCCL INFO P2P Chunksize set to 131072
manolinux:21946:21955 [0] NCCL INFO Channel 00/02 : 0 1 2 3
manolinux:21946:21955 [0] NCCL INFO Channel 01/02 : 0 1 2 3
manolinux:21946:21955 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
manolinux:21946:21955 [0] NCCL INFO P2P Chunksize set to 131072
manolinux1:29687:29696 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
manolinux1:29687:29696 [1] NCCL INFO P2P Chunksize set to 131072
manolinux1:29687:29695 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
manolinux1:29687:29695 [0] NCCL INFO P2P Chunksize set to 131072
manolinux:21946:21956 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:21946:21956 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:29687:29696 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
manolinux1:29687:29696 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:21946:21955 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
manolinux:21946:21955 [0] NCCL INFO Connected all rings
manolinux:21946:21955 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
manolinux1:29687:29696 [1] NCCL INFO Connected all rings
manolinux:21946:21956 [1] NCCL INFO Connected all rings
manolinux1:29687:29695 [0] NCCL INFO Connected all rings
manolinux1:29687:29696 [1] NCCL INFO Channel 00 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:21946:21956 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:29687:29696 [1] NCCL INFO Channel 01 : 3[1] -> 2[0] via SHM/direct/direct
manolinux:21946:21956 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
manolinux1:29687:29695 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [receive] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux1:29687:29695 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [send] via NET/Socket/0
manolinux:21946:21955 [0] NCCL INFO Connected all trees
manolinux:21946:21956 [1] NCCL INFO Connected all trees
manolinux:21946:21955 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:21946:21956 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux:21946:21956 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:21946:21956 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21946:21956 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:21946:21955 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux:21946:21955 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux:21946:21955 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29687:29696 [1] NCCL INFO Connected all trees
manolinux1:29687:29696 [1] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:29687:29695 [0] NCCL INFO Connected all trees
manolinux1:29687:29696 [1] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:29687:29695 [0] NCCL INFO NCCL_PROTO set by environment to SIMPLE
manolinux1:29687:29695 [0] NCCL INFO NCCL_ALGO set by environment to TREE
manolinux1:29687:29696 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29687:29696 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux1:29687:29695 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
manolinux1:29687:29695 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
manolinux:21946:21956 [1] NCCL INFO comm 0x56276b8faa70 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 3000 commId 0x63e8b854898fb452 - Init COMPLETE
manolinux:21946:21955 [0] NCCL INFO comm 0x5627699f0e60 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x63e8b854898fb452 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
manolinux1:29687:29695 [0] NCCL INFO comm 0x560e278682f0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId f000 commId 0x63e8b854898fb452 - Init COMPLETE
manolinux1:29687:29696 [1] NCCL INFO comm 0x560e27b0d810 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 28000 commId 0x63e8b854898fb452 - Init COMPLETE
8 2 float sum -1 233.9 0.00 0.00 4 225.7 0.00 0.00 4
16 4 float sum -1 220.6 0.00 0.00 8 208.5 0.00 0.00 6
32 8 float sum -1 203.1 0.00 0.00 16 226.6 0.00 0.00 12
64 16 float sum -1 227.3 0.00 0.00 30 233.8 0.00 0.00 24
128 32 float sum -1 212.2 0.00 0.00 54 226.9 0.00 0.00 52
256 64 float sum -1 224.4 0.00 0.00 114 237.3 0.00 0.00 118
512 128 float sum -1 236.2 0.00 0.00 234 237.1 0.00 0.00 230
1024 256 float sum -1 237.6 0.00 0.01 436 239.8 0.00 0.01 452
2048 512 float sum -1 241.9 0.01 0.01 864 248.0 0.01 0.01 906
4096 1024 float sum -1 351.0 0.01 0.02 1818 343.7 0.01 0.02 1768
8192 2048 float sum -1 366.8 0.02 0.03 3616 373.8 0.02 0.03 3612
16384 4096 float sum -1 677.9 0.02 0.04 7232 691.0 0.02 0.04 7206
32768 8192 float sum -1 1004.4 0.03 0.05 14402 1016.8 0.03 0.05 14318
65536 16384 float sum -1 1017.3 0.06 0.10 28600 1043.9 0.06 0.09 28676
131072 32768 float sum -1 1523.9 0.09 0.13 57300 1526.0 0.09 0.13 57328
262144 65536 float sum -1 2816.9 0.09 0.14 114442 2801.1 0.09 0.14 114582
524288 131072 float sum -1 5690.3 0.09 0.14 229370 5948.8 0.09 0.13 229484
1048576 262144 float sum -1 11035 0.10 0.14 458762 11255 0.09 0.14 458174
manolinux:21946:21946 [1] NCCL INFO comm 0x5627699f0e60 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
manolinux1:29687:29687 [1] NCCL INFO comm 0x560e278682f0 rank 2 nranks 4 cudaDev 0 busId f000 - Destroy COMPLETE
manolinux:21946:21946 [1] NCCL INFO comm 0x56276b8faa70 rank 1 nranks 4 cudaDev 1 busId 3000 - Destroy COMPLETE
# Out of bounds values : 36 FAILED
# Avg bus bandwidth : 0.0445358
#
manolinux1:29687:29687 [1] NCCL INFO comm 0x560e27b0d810 rank 3 nranks 4 cudaDev 1 busId 28000 - Destroy COMPLETE
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[32802,1],1]
Exit code: 1
--------------------------------------------------------------------------
I changed cuda version to 12.2 but still same problem as above. test is running but fails.
Ok. This looks badly broken. Unfortunately, this is a weird combination of old GPUs, so we can't really justify spending time debugging this more than I already did.
Maybe you'd be luckier with an older version of NCCL, like 2.8 or even 2.4.
If I have following GPUs on both machine then will that be good? or which GPUs you prefer? NVIDIA RTX A2000 Quadro P2200
I don't know really. We don't have systems with any of those GPUs to try with. In general we'd advise to use a single type of GPUs, and RTX/Quadro cards are not our main focus as they're not aimed at multi-GPU training.
ok, will try to use single type of GPU and will let you know.
I replaced K620 with Quadro P22000 and nccl-tests are running fine. can you please point me to a documentation for better understanding of the test results?
I have two nodes manolinux(10.39.43.133) and manolinux1(10.39.42.196) each having two GPUs.
I am running following command on manolinux:
manolinux: cudaDriverVersion 12020 NVIDIA-SMI 535.129.03
manolinux1: cudaDriverVersion 12020 NVIDIA-SMI 535.129.03
Following is the log of this test run:
What is the reason of 'the launch timed out and was terminated'?
if I run above command with gpu 1 then it runs but gives following error:
logs:
it seems the run completed but what is the reason of reporting out of bound values?