NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
829 stars 230 forks source link

Test NCCL failure common with network error. #252

Open ismailguzel opened 2 days ago

ismailguzel commented 2 days ago

I'd like to do NCCL test on two nodes with 4 H100 GPUs per. I compiled nccl-test with MPI version via below commands:

CUDA_HOME=/usr/local/cuda-12.6
NCCL_HOME=/opt/nvidia/nvidia_hpc_benchmarks_mpich/lib/nccl
MPI_HOME=/usr/mpi/gcc/openmpi-4.1.7a1/

make MPI=1 MPI_HOME=$MPI_HOME CUDA_HOME=$CUDA_HOME NCCL_HOME=$NCCL_HOME

Then when I run the following commands, it end up with connection error.

mpirun --allow-run-as-root -x PATH -x LD_LIBRARY_PATH --hostfile ./hostfiles.txt --bind-to none -v -np 8  ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

The output was

# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 203082 on    kolyoz1 device  0 [0x42] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid 203083 on    kolyoz1 device  1 [0x55] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid 203084 on    kolyoz1 device  2 [0xd4] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid 203085 on    kolyoz1 device  3 [0xe6] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid 153049 on    kolyoz2 device  0 [0x42] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid 153050 on    kolyoz2 device  1 [0x55] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid 153051 on    kolyoz2 device  2 [0xd4] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid 153052 on    kolyoz2 device  3 [0xe6] NVIDIA H100 80GB HBM3
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
kolyoz2: Test NCCL failure common.cu:307 'remote process exited or there was a network error / '
 .. kolyoz2 pid 153049: Test failure common.cu:405
 .. kolyoz2 pid 153049: Test failure common.cu:592
 .. kolyoz2 pid 153049: Test failure all_reduce.cu:90
 .. kolyoz2 pid 153049: Test failure common.cu:623
 .. kolyoz2 pid 153049: Test failure common.cu:1078
 .. kolyoz2 pid 153049: Test failure common.cu:891
kolyoz2: Test NCCL failure common.cu:307 'remote process exited or there was a network error / '
 .. kolyoz2 pid 153051: Test failure common.cu:405
 .. kolyoz2 pid 153051: Test failure common.cu:592
 .. kolyoz2 pid 153051: Test failure all_reduce.cu:90
 .. kolyoz2 pid 153051: Test failure common.cu:623
 .. kolyoz2 pid 153051: Test failure common.cu:1078
 .. kolyoz2 pid 153051: Test failure common.cu:891
kolyoz2: Test NCCL failure common.cu:307 'remote process exited or there was a network error / '
 .. kolyoz2 pid 153050: Test failure common.cu:405
 .. kolyoz2 pid 153050: Test failure common.cu:592
 .. kolyoz2 pid 153050: Test failure all_reduce.cu:90
 .. kolyoz2 pid 153050: Test failure common.cu:623
 .. kolyoz2 pid 153050: Test failure common.cu:1078
 .. kolyoz2 pid 153050: Test failure common.cu:891
kolyoz2: Test NCCL failure common.cu:307 'remote process exited or there was a network error / '
 .. kolyoz2 pid 153052: Test failure common.cu:405
 .. kolyoz2 pid 153052: Test failure common.cu:592
 .. kolyoz2 pid 153052: Test failure all_reduce.cu:90
 .. kolyoz2 pid 153052: Test failure common.cu:623
 .. kolyoz2 pid 153052: Test failure common.cu:1078
 .. kolyoz2 pid 153052: Test failure common.cu:891
kolyoz1: Test NCCL failure common.cu:307 'remote process exited or there was a network error / '
 .. kolyoz1 pid 203082: Test failure common.cu:405
 .. kolyoz1 pid 203082: Test failure common.cu:592
 .. kolyoz1 pid 203082: Test failure all_reduce.cu:90
 .. kolyoz1 pid 203082: Test failure common.cu:623
 .. kolyoz1 pid 203082: Test failure common.cu:1078
 .. kolyoz1 pid 203082: Test failure common.cu:891
kolyoz1: Test NCCL failure common.cu:307 'remote process exited or there was a network error / '
 .. kolyoz1 pid 203083: Test failure common.cu:405
 .. kolyoz1 pid 203083: Test failure common.cu:592
 .. kolyoz1 pid 203083: Test failure all_reduce.cu:90
 .. kolyoz1 pid 203083: Test failure common.cu:623
 .. kolyoz1 pid 203083: Test failure common.cu:1078
 .. kolyoz1 pid 203083: Test failure common.cu:891
kolyoz1: Test NCCL failure common.cu:307 'remote process exited or there was a network error / '
 .. kolyoz1 pid 203085: Test failure common.cu:405
 .. kolyoz1 pid 203085: Test failure common.cu:592
 .. kolyoz1 pid 203085: Test failure all_reduce.cu:90
 .. kolyoz1 pid 203085: Test failure common.cu:623
 .. kolyoz1 pid 203085: Test failure common.cu:1078
 .. kolyoz1 pid 203085: Test failure common.cu:891
kolyoz1: Test NCCL failure common.cu:307 'remote process exited or there was a network error / '
 .. kolyoz1 pid 203084: Test failure common.cu:405
 .. kolyoz1 pid 203084: Test failure common.cu:592
 .. kolyoz1 pid 203084: Test failure all_reduce.cu:90
 .. kolyoz1 pid 203084: Test failure common.cu:623
 .. kolyoz1 pid 203084: Test failure common.cu:1078
 .. kolyoz1 pid 203084: Test failure common.cu:891

Could you hep me to solve this issues?

AddyLaddy commented 1 day ago

We'd need to see the NCCL_DEBUG=INFO logs in order to be able to help you. Pass it on the mpirun command line with -x NCCL_DEBUG=INFO

ismailguzel commented 1 day ago

By the way, the system is Rocky Linux 9. Here is the logs NCCL test inside of container satisfied from NGC hpc-benchmarks-24.06:


# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  27255 on    kolyoz1 device  0 [0x42] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid  27223 on    kolyoz1 device  1 [0x55] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid  27224 on    kolyoz1 device  2 [0xd4] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid  27225 on    kolyoz1 device  3 [0xe6] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid  25146 on    kolyoz2 device  0 [0x42] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid  25172 on    kolyoz2 device  1 [0x55] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid  25169 on    kolyoz2 device  2 [0xd4] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid  25148 on    kolyoz2 device  3 [0xe6] NVIDIA H100 80GB HBM3
kolyoz1:27255:27255 [0] NCCL INFO Bootstrap : Using ib0:10.0.35.1<0>
kolyoz1:27255:27255 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.4
kolyoz1:27225:27225 [3] NCCL INFO cudaDriverVersion 12060
kolyoz1:27225:27225 [3] NCCL INFO Bootstrap : Using ib0:10.0.35.1<0>
kolyoz2:25148:25148 [3] NCCL INFO cudaDriverVersion 12060
kolyoz2:25148:25148 [3] NCCL INFO Bootstrap : Using ib0:10.0.35.2<0>
kolyoz1:27223:27223 [1] NCCL INFO cudaDriverVersion 12060
kolyoz1:27223:27223 [1] NCCL INFO Bootstrap : Using ib0:10.0.35.1<0>
kolyoz1:27224:27224 [2] NCCL INFO cudaDriverVersion 12060
kolyoz1:27224:27224 [2] NCCL INFO Bootstrap : Using ib0:10.0.35.1<0>
kolyoz2:25146:25146 [0] NCCL INFO cudaDriverVersion 12060
kolyoz2:25146:25146 [0] NCCL INFO Bootstrap : Using ib0:10.0.35.2<0>
kolyoz2:25172:25172 [1] NCCL INFO cudaDriverVersion 12060
kolyoz2:25172:25172 [1] NCCL INFO Bootstrap : Using ib0:10.0.35.2<0>
kolyoz2:25169:25169 [2] NCCL INFO cudaDriverVersion 12060
kolyoz2:25169:25169 [2] NCCL INFO Bootstrap : Using ib0:10.0.35.2<0>
kolyoz2:25148:25826 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
kolyoz2:25148:25826 [3] NCCL INFO P2P plugin IBext_v8
kolyoz2:25172:25828 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
kolyoz2:25172:25828 [1] NCCL INFO P2P plugin IBext_v8
kolyoz1:27255:27906 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
kolyoz1:27255:27906 [0] NCCL INFO P2P plugin IBext_v8
kolyoz2:25148:25826 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ib0:10.0.35.2<0>
kolyoz2:25148:25826 [3] NCCL INFO Using non-device net plugin version 0
kolyoz2:25148:25826 [3] NCCL INFO Using network IBext_v8
kolyoz2:25172:25828 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ib0:10.0.35.2<0>
kolyoz2:25172:25828 [1] NCCL INFO Using non-device net plugin version 0
kolyoz2:25172:25828 [1] NCCL INFO Using network IBext_v8
kolyoz1:27223:27908 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
kolyoz1:27223:27908 [1] NCCL INFO P2P plugin IBext_v8
kolyoz1:27255:27906 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ib0:10.0.35.1<0>
kolyoz1:27255:27906 [0] NCCL INFO Using non-device net plugin version 0
kolyoz1:27255:27906 [0] NCCL INFO Using network IBext_v8
kolyoz1:27223:27908 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ib0:10.0.35.1<0>
kolyoz1:27223:27908 [1] NCCL INFO Using non-device net plugin version 0
kolyoz1:27223:27908 [1] NCCL INFO Using network IBext_v8
kolyoz1:27224:27909 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
kolyoz1:27224:27909 [2] NCCL INFO P2P plugin IBext_v8
kolyoz1:27224:27909 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ib0:10.0.35.1<0>
kolyoz1:27224:27909 [2] NCCL INFO Using non-device net plugin version 0
kolyoz1:27224:27909 [2] NCCL INFO Using network IBext_v8
kolyoz1:27225:27907 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
kolyoz1:27225:27907 [3] NCCL INFO P2P plugin IBext_v8
kolyoz1:27225:27907 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ib0:10.0.35.1<0>
kolyoz1:27225:27907 [3] NCCL INFO Using non-device net plugin version 0
kolyoz1:27225:27907 [3] NCCL INFO Using network IBext_v8
kolyoz2:25148:25826 [3] NCCL INFO DMA-BUF is available on GPU device 3
kolyoz2:25172:25828 [1] NCCL INFO DMA-BUF is available on GPU device 1
kolyoz2:25146:25827 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
kolyoz2:25146:25827 [0] NCCL INFO P2P plugin IBext_v8
kolyoz2:25169:25829 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
kolyoz2:25169:25829 [2] NCCL INFO P2P plugin IBext_v8
kolyoz2:25146:25827 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ib0:10.0.35.2<0>
kolyoz2:25146:25827 [0] NCCL INFO Using non-device net plugin version 0
kolyoz2:25146:25827 [0] NCCL INFO Using network IBext_v8
kolyoz1:27255:27906 [0] NCCL INFO DMA-BUF is available on GPU device 0
kolyoz2:25169:25829 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ib0:10.0.35.2<0>
kolyoz2:25169:25829 [2] NCCL INFO Using non-device net plugin version 0
kolyoz2:25169:25829 [2] NCCL INFO Using network IBext_v8
kolyoz1:27223:27908 [1] NCCL INFO DMA-BUF is available on GPU device 1
kolyoz1:27224:27909 [2] NCCL INFO DMA-BUF is available on GPU device 2
kolyoz1:27225:27907 [3] NCCL INFO DMA-BUF is available on GPU device 3
kolyoz2:25146:25827 [0] NCCL INFO DMA-BUF is available on GPU device 0
kolyoz2:25169:25829 [2] NCCL INFO DMA-BUF is available on GPU device 2
kolyoz2:25146:25827 [0] NCCL INFO ncclCommInitRank comm 0x559d5aa8ca00 rank 4 nranks 8 cudaDev 0 nvmlDev 0 busId 42000 commId 0x996ee6f117cb8e1e - Init START
kolyoz1:27255:27906 [0] NCCL INFO ncclCommInitRank comm 0x555de9802860 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 42000 commId 0x996ee6f117cb8e1e - Init START
kolyoz2:25172:25828 [1] NCCL INFO ncclCommInitRank comm 0x556759b190e0 rank 5 nranks 8 cudaDev 1 nvmlDev 1 busId 55000 commId 0x996ee6f117cb8e1e - Init START
kolyoz2:25169:25829 [2] NCCL INFO ncclCommInitRank comm 0x55a497ca6750 rank 6 nranks 8 cudaDev 2 nvmlDev 2 busId d4000 commId 0x996ee6f117cb8e1e - Init START
kolyoz2:25148:25826 [3] NCCL INFO ncclCommInitRank comm 0x5628b8d43360 rank 7 nranks 8 cudaDev 3 nvmlDev 3 busId e6000 commId 0x996ee6f117cb8e1e - Init START
kolyoz1:27223:27908 [1] NCCL INFO ncclCommInitRank comm 0x55e38c069f50 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 55000 commId 0x996ee6f117cb8e1e - Init START
kolyoz1:27224:27909 [2] NCCL INFO ncclCommInitRank comm 0x5644540d62c0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId d4000 commId 0x996ee6f117cb8e1e - Init START
kolyoz1:27225:27907 [3] NCCL INFO ncclCommInitRank comm 0x558b580881d0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId e6000 commId 0x996ee6f117cb8e1e - Init START
kolyoz2:25172:25828 [1] NCCL INFO Setting affinity for GPU 1 to fffffffd
kolyoz2:25172:25828 [1] NCCL INFO NVLS multicast support is not available on dev 1
kolyoz2:25148:25826 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000
kolyoz2:25148:25826 [3] NCCL INFO NVLS multicast support is not available on dev 3
kolyoz2:25146:25827 [0] NCCL INFO Setting affinity for GPU 0 to fffffffd
kolyoz2:25169:25829 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000
kolyoz2:25146:25827 [0] NCCL INFO NVLS multicast support is not available on dev 0
kolyoz2:25169:25829 [2] NCCL INFO NVLS multicast support is not available on dev 2
kolyoz1:27225:27907 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000
kolyoz1:27225:27907 [3] NCCL INFO NVLS multicast support is not available on dev 3
kolyoz1:27224:27909 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000
kolyoz1:27224:27909 [2] NCCL INFO NVLS multicast support is not available on dev 2
kolyoz1:27223:27908 [1] NCCL INFO Setting affinity for GPU 1 to fffffffd
kolyoz1:27223:27908 [1] NCCL INFO NVLS multicast support is not available on dev 1
kolyoz1:27255:27906 [0] NCCL INFO Setting affinity for GPU 0 to fffffffd
kolyoz1:27255:27906 [0] NCCL INFO NVLS multicast support is not available on dev 0
kolyoz2:25148:25826 [3] NCCL INFO comm 0x5628b8d43360 rank 7 nRanks 8 nNodes 2 localRanks 4 localRank 3 MNNVL 0
kolyoz2:25148:25826 [3] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] 4/-1/-1->7->6 [2] -1/-1/-1->7->5 [3] 4/-1/-1->7->3 [4] -1/-1/-1->7->6 [5] 4/-1/-1->7->6 [6] -1/-1/-1->7->5 [7] 4/3/-1->7->-1
kolyoz2:25148:25826 [3] NCCL INFO P2P Chunksize set to 131072
kolyoz1:27255:27906 [0] NCCL INFO comm 0x555de9802860 rank 0 nRanks 8 nNodes 2 localRanks 4 localRank 0 MNNVL 0
kolyoz1:27255:27906 [0] NCCL INFO Channel 00/08 :    0   1   2   3   4   5   6   7
kolyoz1:27255:27906 [0] NCCL INFO Channel 01/08 :    0   5   7   6   4   1   3   2
kolyoz1:27255:27906 [0] NCCL INFO Channel 02/08 :    0   3   6   5   4   7   2   1
kolyoz1:27255:27906 [0] NCCL INFO Channel 03/08 :    0   2   7   5   4   6   3   1
kolyoz1:27255:27906 [0] NCCL INFO Channel 04/08 :    0   1   2   3   4   5   6   7
kolyoz1:27255:27906 [0] NCCL INFO Channel 05/08 :    0   5   7   6   4   1   3   2
kolyoz1:27255:27906 [0] NCCL INFO Channel 06/08 :    0   3   6   5   4   7   2   1
kolyoz1:27255:27906 [0] NCCL INFO Channel 07/08 :    0   2   7   5   4   6   3   1
kolyoz1:27255:27906 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] -1/-1/-1->0->3 [2] 1/-1/-1->0->2 [3] 2/-1/-1->0->3 [4] 1/-1/-1->0->4 [5] -1/-1/-1->0->3 [6] 1/-1/-1->0->2 [7] 2/-1/-1->0->3
kolyoz1:27255:27906 [0] NCCL INFO P2P Chunksize set to 131072
kolyoz2:25172:25828 [1] NCCL INFO comm 0x556759b190e0 rank 5 nRanks 8 nNodes 2 localRanks 4 localRank 1 MNNVL 0
kolyoz2:25172:25828 [1] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->1 [2] 7/-1/-1->5->4 [3] -1/-1/-1->5->6 [4] 6/-1/-1->5->4 [5] 6/1/-1->5->-1 [6] 7/-1/-1->5->4 [7] -1/-1/-1->5->6
kolyoz2:25172:25828 [1] NCCL INFO P2P Chunksize set to 131072
kolyoz1:27223:27908 [1] NCCL INFO comm 0x55e38c069f50 rank 1 nRanks 8 nNodes 2 localRanks 4 localRank 1 MNNVL 0
kolyoz1:27223:27908 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/5/-1->1->-1 [2] 3/-1/-1->1->0 [3] -1/-1/-1->1->2 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->5 [6] 3/-1/-1->1->0 [7] -1/-1/-1->1->2
kolyoz1:27223:27908 [1] NCCL INFO P2P Chunksize set to 131072
kolyoz2:25169:25829 [2] NCCL INFO comm 0x55a497ca6750 rank 6 nRanks 8 nNodes 2 localRanks 4 localRank 2 MNNVL 0
kolyoz2:25169:25829 [2] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 4/-1/-1->6->2 [3] 5/-1/-1->6->4 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 4/2/-1->6->-1 [7] 5/-1/-1->6->4
kolyoz2:25169:25829 [2] NCCL INFO P2P Chunksize set to 131072
kolyoz1:27224:27909 [2] NCCL INFO comm 0x5644540d62c0 rank 2 nRanks 8 nNodes 2 localRanks 4 localRank 2 MNNVL 0
kolyoz1:27224:27909 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 0/6/-1->2->-1 [3] 1/-1/-1->2->0 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 0/-1/-1->2->6 [7] 1/-1/-1->2->0
kolyoz1:27224:27909 [2] NCCL INFO P2P Chunksize set to 131072
kolyoz2:25146:25827 [0] NCCL INFO comm 0x559d5aa8ca00 rank 4 nRanks 8 nNodes 2 localRanks 4 localRank 0 MNNVL 0
kolyoz2:25146:25827 [0] NCCL INFO Trees [0] 5/-1/-1->4->0 [1] -1/-1/-1->4->7 [2] 5/-1/-1->4->6 [3] 6/-1/-1->4->7 [4] 5/0/-1->4->-1 [5] -1/-1/-1->4->7 [6] 5/-1/-1->4->6 [7] 6/-1/-1->4->7
kolyoz2:25146:25827 [0] NCCL INFO P2P Chunksize set to 131072
kolyoz1:27225:27907 [3] NCCL INFO comm 0x558b580881d0 rank 3 nRanks 8 nNodes 2 localRanks 4 localRank 3 MNNVL 0
kolyoz1:27225:27907 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 0/-1/-1->3->2 [2] -1/-1/-1->3->1 [3] 0/7/-1->3->-1 [4] -1/-1/-1->3->2 [5] 0/-1/-1->3->2 [6] -1/-1/-1->3->1 [7] 0/-1/-1->3->7
kolyoz1:27225:27907 [3] NCCL INFO P2P Chunksize set to 131072
kolyoz1:27223:27908 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 00/0 : 5[1] -> 6[2] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 00/0 : 6[2] -> 7[3] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 04/0 : 5[1] -> 6[2] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 04/0 : 6[2] -> 7[3] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 00/0 : 3[3] -> 4[0] [receive] via NET/IBext_v8/0/GDRDMA
kolyoz1:27255:27906 [0] NCCL INFO Channel 00/0 : 7[3] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
kolyoz2:25148:25826 [3] NCCL INFO Channel 00/0 : 7[3] -> 0[0] [send] via NET/IBext_v8/0(4)/GDRDMA
kolyoz2:25146:25827 [0] NCCL INFO Channel 04/0 : 3[3] -> 4[0] [receive] via NET/IBext_v8/0/GDRDMA
kolyoz2:25146:25827 [0] NCCL INFO Channel 00/0 : 4[0] -> 5[1] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 04/0 : 7[3] -> 0[0] [send] via NET/IBext_v8/0(4)/GDRDMA
kolyoz1:27225:27907 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[0] [send] via NET/IBext_v8/0(0)/GDRDMA
kolyoz2:25146:25827 [0] NCCL INFO Channel 04/0 : 4[0] -> 5[1] via P2P/CUMEM
kolyoz1:27255:27906 [0] NCCL INFO Channel 04/0 : 7[3] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
kolyoz1:27255:27906 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[0] [send] via NET/IBext_v8/0(0)/GDRDMA
kolyoz1:27255:27906 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 01/0 : 5[1] -> 7[3] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 03/0 : 4[0] -> 6[2] via P2P/CUMEM
kolyoz1:27255:27906 [0] NCCL INFO Channel 03/0 : 0[0] -> 2[2] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 07/0 : 4[0] -> 6[2] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 01/0 : 1[1] -> 3[3] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 05/0 : 5[1] -> 7[3] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 02/0 : 4[0] -> 7[3] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 05/0 : 1[1] -> 3[3] via P2P/CUMEM
kolyoz1:27255:27906 [0] NCCL INFO Channel 07/0 : 0[0] -> 2[2] via P2P/CUMEM
kolyoz1:27255:27906 [0] NCCL INFO Channel 02/0 : 0[0] -> 3[3] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 06/0 : 4[0] -> 7[3] via P2P/CUMEM
kolyoz1:27255:27906 [0] NCCL INFO Channel 06/0 : 0[0] -> 3[3] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 02/0 : 3[3] -> 6[2] [receive] via NET/IBext_v8/2/GDRDMA
kolyoz2:25148:25826 [3] NCCL INFO Channel 02/0 : 7[3] -> 2[2] [send] via NET/IBext_v8/2(6)/GDRDMA
kolyoz2:25169:25829 [2] NCCL INFO Channel 06/0 : 3[3] -> 6[2] [receive] via NET/IBext_v8/2/GDRDMA
kolyoz2:25148:25826 [3] NCCL INFO Channel 06/0 : 7[3] -> 2[2] [send] via NET/IBext_v8/2(6)/GDRDMA
kolyoz1:27223:27908 [1] NCCL INFO Channel 01/0 : 4[0] -> 1[1] [receive] via NET/IBext_v8/1/GDRDMA
kolyoz1:27223:27908 [1] NCCL INFO Channel 05/0 : 4[0] -> 1[1] [receive] via NET/IBext_v8/1/GDRDMA
kolyoz2:25172:25828 [1] NCCL INFO Channel 01/0 : 0[0] -> 5[1] [receive] via NET/IBext_v8/1/GDRDMA
kolyoz2:25146:25827 [0] NCCL INFO Channel 01/0 : 4[0] -> 1[1] [send] via NET/IBext_v8/1(5)/GDRDMA
kolyoz1:27224:27909 [2] NCCL INFO Channel 02/0 : 7[3] -> 2[2] [receive] via NET/IBext_v8/2/GDRDMA
kolyoz2:25172:25828 [1] NCCL INFO Channel 05/0 : 0[0] -> 5[1] [receive] via NET/IBext_v8/1/GDRDMA
kolyoz1:27225:27907 [3] NCCL INFO Channel 02/0 : 3[3] -> 6[2] [send] via NET/IBext_v8/2(2)/GDRDMA
kolyoz1:27224:27909 [2] NCCL INFO Channel 06/0 : 7[3] -> 2[2] [receive] via NET/IBext_v8/2/GDRDMA
kolyoz1:27225:27907 [3] NCCL INFO Channel 06/0 : 3[3] -> 6[2] [send] via NET/IBext_v8/2(2)/GDRDMA
kolyoz1:27255:27906 [0] NCCL INFO Channel 01/0 : 0[0] -> 5[1] [send] via NET/IBext_v8/1(1)/GDRDMA
kolyoz1:27255:27906 [0] NCCL INFO Channel 05/0 : 0[0] -> 5[1] [send] via NET/IBext_v8/1(1)/GDRDMA
kolyoz2:25146:25827 [0] NCCL INFO Channel 05/0 : 4[0] -> 1[1] [send] via NET/IBext_v8/1(5)/GDRDMA
kolyoz1:27225:27907 [3] NCCL INFO Channel 03/0 : 6[2] -> 3[3] [receive] via NET/IBext_v8/3/GDRDMA
kolyoz2:25169:25829 [2] NCCL INFO Channel 03/0 : 6[2] -> 3[3] [send] via NET/IBext_v8/3(7)/GDRDMA
kolyoz1:27224:27909 [2] NCCL INFO Channel 03/0 : 2[2] -> 7[3] [send] via NET/IBext_v8/3(3)/GDRDMA
kolyoz2:25148:25826 [3] NCCL INFO Channel 03/0 : 2[2] -> 7[3] [receive] via NET/IBext_v8/3/GDRDMA
kolyoz1:27225:27907 [3] NCCL INFO Channel 07/0 : 6[2] -> 3[3] [receive] via NET/IBext_v8/3/GDRDMA
kolyoz2:25169:25829 [2] NCCL INFO Channel 07/0 : 6[2] -> 3[3] [send] via NET/IBext_v8/3(7)/GDRDMA
kolyoz1:27224:27909 [2] NCCL INFO Channel 07/0 : 2[2] -> 7[3] [send] via NET/IBext_v8/3(3)/GDRDMA
kolyoz1:27224:27909 [2] NCCL INFO Channel 01/0 : 2[2] -> 0[0] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 07/0 : 2[2] -> 7[3] [receive] via NET/IBext_v8/3/GDRDMA
kolyoz1:27225:27907 [3] NCCL INFO Channel 03/0 : 3[3] -> 1[1] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 01/0 : 6[2] -> 4[0] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 03/0 : 7[3] -> 5[1] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 05/0 : 2[2] -> 0[0] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 05/0 : 6[2] -> 4[0] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 07/0 : 3[3] -> 1[1] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 07/0 : 7[3] -> 5[1] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 01/0 : 7[3] -> 6[2] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 05/0 : 7[3] -> 6[2] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 02/0 : 5[1] -> 4[0] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 02/0 : 6[2] -> 5[1] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 03/0 : 5[1] -> 4[0] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 06/0 : 5[1] -> 4[0] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 06/0 : 6[2] -> 5[1] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 07/0 : 5[1] -> 4[0] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
kolyoz1:27255:27906 [0] NCCL INFO Connected all rings
kolyoz1:27255:27906 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Connected all rings
kolyoz2:25146:25827 [0] NCCL INFO Channel 02/0 : 4[0] -> 5[1] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Connected all rings
kolyoz1:27224:27909 [2] NCCL INFO Connected all rings
kolyoz2:25172:25828 [1] NCCL INFO Connected all rings
kolyoz2:25169:25829 [2] NCCL INFO Connected all rings
kolyoz2:25148:25826 [3] NCCL INFO Connected all rings
kolyoz1:27225:27907 [3] NCCL INFO Connected all rings
kolyoz1:27255:27906 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 06/0 : 4[0] -> 5[1] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 01/0 : 5[1] -> 6[2] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 03/0 : 5[1] -> 6[2] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 05/0 : 5[1] -> 6[2] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 07/0 : 5[1] -> 6[2] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
kolyoz1:27255:27906 [0] NCCL INFO Channel 02/0 : 0[0] -> 2[2] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 01/0 : 6[2] -> 7[3] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 05/0 : 6[2] -> 7[3] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 02/0 : 4[0] -> 6[2] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 02/0 : 1[1] -> 3[3] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 02/0 : 5[1] -> 7[3] via P2P/CUMEM
kolyoz1:27255:27906 [0] NCCL INFO Channel 06/0 : 0[0] -> 2[2] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 06/0 : 1[1] -> 3[3] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 06/0 : 4[0] -> 6[2] via P2P/CUMEM
kolyoz1:27255:27906 [0] NCCL INFO Channel 01/0 : 0[0] -> 3[3] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 06/0 : 5[1] -> 7[3] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 02/0 : 6[2] -> 2[2] [receive] via NET/IBext_v8/2/GDRDMA
kolyoz2:25146:25827 [0] NCCL INFO Channel 01/0 : 4[0] -> 7[3] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 01/0 : 5[1] -> 1[1] [receive] via NET/IBext_v8/1/GDRDMA
kolyoz1:27224:27909 [2] NCCL INFO Channel 06/0 : 6[2] -> 2[2] [receive] via NET/IBext_v8/2/GDRDMA
kolyoz1:27224:27909 [2] NCCL INFO Channel 02/0 : 2[2] -> 6[2] [send] via NET/IBext_v8/2/GDRDMA
kolyoz1:27223:27908 [1] NCCL INFO Channel 05/0 : 5[1] -> 1[1] [receive] via NET/IBext_v8/1/GDRDMA
kolyoz1:27223:27908 [1] NCCL INFO Channel 01/0 : 1[1] -> 5[1] [send] via NET/IBext_v8/1/GDRDMA
kolyoz1:27224:27909 [2] NCCL INFO Channel 06/0 : 2[2] -> 6[2] [send] via NET/IBext_v8/2/GDRDMA
kolyoz1:27255:27906 [0] NCCL INFO Channel 03/0 : 0[0] -> 3[3] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 02/0 : 2[2] -> 6[2] [receive] via NET/IBext_v8/2/GDRDMA
kolyoz2:25169:25829 [2] NCCL INFO Channel 06/0 : 2[2] -> 6[2] [receive] via NET/IBext_v8/2/GDRDMA
kolyoz1:27223:27908 [1] NCCL INFO Channel 05/0 : 1[1] -> 5[1] [send] via NET/IBext_v8/1/GDRDMA
kolyoz1:27255:27906 [0] NCCL INFO Channel 05/0 : 0[0] -> 3[3] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 01/0 : 1[1] -> 5[1] [receive] via NET/IBext_v8/1/GDRDMA
kolyoz1:27255:27906 [0] NCCL INFO Channel 07/0 : 0[0] -> 3[3] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 02/0 : 6[2] -> 2[2] [send] via NET/IBext_v8/2/GDRDMA
kolyoz2:25169:25829 [2] NCCL INFO Channel 06/0 : 6[2] -> 2[2] [send] via NET/IBext_v8/2/GDRDMA
kolyoz2:25172:25828 [1] NCCL INFO Channel 05/0 : 1[1] -> 5[1] [receive] via NET/IBext_v8/1/GDRDMA
kolyoz2:25172:25828 [1] NCCL INFO Channel 01/0 : 5[1] -> 1[1] [send] via NET/IBext_v8/1/GDRDMA
kolyoz2:25172:25828 [1] NCCL INFO Channel 05/0 : 5[1] -> 1[1] [send] via NET/IBext_v8/1/GDRDMA
kolyoz2:25169:25829 [2] NCCL INFO Channel 02/0 : 6[2] -> 4[0] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 02/0 : 2[2] -> 0[0] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 03/0 : 4[0] -> 7[3] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 03/0 : 7[3] -> 3[3] [receive] via NET/IBext_v8/3/GDRDMA
kolyoz2:25169:25829 [2] NCCL INFO Channel 03/0 : 6[2] -> 4[0] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 05/0 : 4[0] -> 7[3] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 07/0 : 7[3] -> 3[3] [receive] via NET/IBext_v8/3/GDRDMA
kolyoz1:27255:27906 [0] NCCL INFO Channel 00/0 : 4[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
kolyoz1:27224:27909 [2] NCCL INFO Channel 03/0 : 2[2] -> 0[0] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 03/0 : 3[3] -> 7[3] [send] via NET/IBext_v8/3/GDRDMA
kolyoz1:27255:27906 [0] NCCL INFO Channel 04/0 : 4[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
kolyoz1:27255:27906 [0] NCCL INFO Channel 00/0 : 0[0] -> 4[0] [send] via NET/IBext_v8/0/GDRDMA
kolyoz1:27225:27907 [3] NCCL INFO Channel 07/0 : 3[3] -> 7[3] [send] via NET/IBext_v8/3/GDRDMA
kolyoz2:25169:25829 [2] NCCL INFO Channel 06/0 : 6[2] -> 4[0] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 07/0 : 4[0] -> 7[3] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 07/0 : 6[2] -> 4[0] via P2P/CUMEM
kolyoz1:27255:27906 [0] NCCL INFO Channel 04/0 : 0[0] -> 4[0] [send] via NET/IBext_v8/0/GDRDMA
kolyoz1:27224:27909 [2] NCCL INFO Channel 06/0 : 2[2] -> 0[0] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 07/0 : 2[2] -> 0[0] via P2P/CUMEM
kolyoz2:25146:25827 [0] NCCL INFO Channel 00/0 : 0[0] -> 4[0] [receive] via NET/IBext_v8/0/GDRDMA
kolyoz2:25148:25826 [3] NCCL INFO Channel 03/0 : 3[3] -> 7[3] [receive] via NET/IBext_v8/3/GDRDMA
kolyoz2:25148:25826 [3] NCCL INFO Channel 07/0 : 3[3] -> 7[3] [receive] via NET/IBext_v8/3/GDRDMA
kolyoz2:25146:25827 [0] NCCL INFO Channel 04/0 : 0[0] -> 4[0] [receive] via NET/IBext_v8/0/GDRDMA
kolyoz2:25148:25826 [3] NCCL INFO Channel 03/0 : 7[3] -> 3[3] [send] via NET/IBext_v8/3/GDRDMA
kolyoz2:25146:25827 [0] NCCL INFO Channel 00/0 : 4[0] -> 0[0] [send] via NET/IBext_v8/0/GDRDMA
kolyoz2:25148:25826 [3] NCCL INFO Channel 07/0 : 7[3] -> 3[3] [send] via NET/IBext_v8/3/GDRDMA
kolyoz2:25146:25827 [0] NCCL INFO Channel 04/0 : 4[0] -> 0[0] [send] via NET/IBext_v8/0/GDRDMA
kolyoz1:27225:27907 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 01/0 : 7[3] -> 4[0] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 03/0 : 7[3] -> 4[0] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 05/0 : 3[3] -> 0[0] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 05/0 : 7[3] -> 4[0] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 07/0 : 3[3] -> 0[0] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 07/0 : 7[3] -> 4[0] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 02/0 : 7[3] -> 5[1] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 06/0 : 7[3] -> 5[1] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 00/0 : 7[3] -> 6[2] via P2P/CUMEM
kolyoz2:25148:25826 [3] NCCL INFO Channel 04/0 : 7[3] -> 6[2] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 00/0 : 6[2] -> 5[1] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 00/0 : 5[1] -> 4[0] via P2P/CUMEM
kolyoz2:25172:25828 [1] NCCL INFO Channel 04/0 : 5[1] -> 4[0] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 01/0 : 6[2] -> 5[1] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 03/0 : 6[2] -> 5[1] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 04/0 : 6[2] -> 5[1] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 05/0 : 6[2] -> 5[1] via P2P/CUMEM
kolyoz2:25169:25829 [2] NCCL INFO Channel 07/0 : 6[2] -> 5[1] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 02/0 : 3[3] -> 1[1] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 06/0 : 3[3] -> 1[1] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
kolyoz1:27225:27907 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
kolyoz1:27223:27908 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM
kolyoz1:27224:27909 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM
kolyoz1:27255:27906 [0] NCCL INFO Connected all trees
kolyoz2:25146:25827 [0] NCCL INFO Connected all trees
kolyoz2:25146:25827 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
kolyoz2:25146:25827 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
kolyoz1:27255:27906 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
kolyoz1:27255:27906 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
kolyoz2:25172:25828 [1] NCCL INFO Connected all trees
kolyoz2:25172:25828 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
kolyoz2:25172:25828 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
kolyoz1:27224:27909 [2] NCCL INFO Connected all trees
kolyoz1:27224:27909 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
kolyoz1:27224:27909 [2] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
kolyoz2:25148:25826 [3] NCCL INFO Connected all trees
kolyoz1:27223:27908 [1] NCCL INFO Connected all trees
kolyoz1:27223:27908 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
kolyoz2:25169:25829 [2] NCCL INFO Connected all trees
kolyoz2:25169:25829 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
kolyoz2:25169:25829 [2] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
kolyoz1:27225:27907 [3] NCCL INFO Connected all trees
kolyoz2:25148:25826 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
kolyoz2:25148:25826 [3] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
kolyoz1:27223:27908 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
kolyoz1:27225:27907 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
kolyoz1:27225:27907 [3] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
kolyoz2:25146:25827 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
kolyoz2:25146:25827 [0] NCCL INFO ncclCommInitRank comm 0x559d5aa8ca00 rank 4 nranks 8 cudaDev 0 nvmlDev 0 busId 42000 commId 0x996ee6f117cb8e1e - Init COMPLETE
kolyoz2:25172:25828 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
kolyoz2:25172:25828 [1] NCCL INFO ncclCommInitRank comm 0x556759b190e0 rank 5 nranks 8 cudaDev 1 nvmlDev 1 busId 55000 commId 0x996ee6f117cb8e1e - Init COMPLETE
kolyoz2:25169:25829 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
kolyoz2:25169:25829 [2] NCCL INFO ncclCommInitRank comm 0x55a497ca6750 rank 6 nranks 8 cudaDev 2 nvmlDev 2 busId d4000 commId 0x996ee6f117cb8e1e - Init COMPLETE
kolyoz2:25148:25826 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
kolyoz2:25148:25826 [3] NCCL INFO ncclCommInitRank comm 0x5628b8d43360 rank 7 nranks 8 cudaDev 3 nvmlDev 3 busId e6000 commId 0x996ee6f117cb8e1e - Init COMPLETE
kolyoz1:27223:27908 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
kolyoz1:27223:27908 [1] NCCL INFO ncclCommInitRank comm 0x55e38c069f50 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 55000 commId 0x996ee6f117cb8e1e - Init COMPLETE
kolyoz1:27224:27909 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
kolyoz1:27224:27909 [2] NCCL INFO ncclCommInitRank comm 0x5644540d62c0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId d4000 commId 0x996ee6f117cb8e1e - Init COMPLETE
kolyoz1:27225:27907 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
kolyoz1:27225:27907 [3] NCCL INFO ncclCommInitRank comm 0x558b580881d0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId e6000 commId 0x996ee6f117cb8e1e - Init COMPLETE
kolyoz1:27255:27906 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
kolyoz1:27255:27906 [0] NCCL INFO ncclCommInitRank comm 0x555de9802860 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 42000 commId 0x996ee6f117cb8e1e - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

kolyoz1:27223:27936 [1] ib_plugin.c:1105 NCCL WARN NET/IB : Got completion from peer 10.0.35.2<41901> with error 4, opcode 5352, len 5352, vendor err 81 (Send)
kolyoz1:27223:27936 [1] NCCL INFO transport/net.cc:1137 -> 6
kolyoz1:27223:27936 [1] NCCL INFO proxy.cc:698 -> 6
kolyoz1:27223:27936 [1] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]

kolyoz2:25172:25856 [1] ib_plugin.c:1105 NCCL WARN NET/IB : Got completion from peer 10.0.35.1<50635> with error 4, opcode 5430, len 5430, vendor err 81 (Send)
kolyoz2:25172:25856 [1] NCCL INFO transport/net.cc:1137 -> 6
kolyoz2:25172:25856 [1] NCCL INFO proxy.cc:698 -> 6
kolyoz2:25172:25856 [1] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]

kolyoz1:27225:27937 [3] ib_plugin.c:1105 NCCL WARN NET/IB : Got completion from peer 10.0.35.2<40737> with error 4, opcode 5450, len 5450, vendor err 81 (Send)
kolyoz1:27225:27937 [3] NCCL INFO transport/net.cc:1137 -> 6
kolyoz1:27225:27937 [3] NCCL INFO proxy.cc:698 -> 6

kolyoz2:25169:25855 [2] ib_plugin.c:1105 NCCL WARN NET/IB : Got completion from peer 10.0.35.1<60143> with error 4, opcode 5287, len 5287, vendor err 81 (Send)
kolyoz2:25169:25855 [2] NCCL INFO transport/net.cc:1137 -> 6
kolyoz2:25169:25855 [2] NCCL INFO proxy.cc:698 -> 6
kolyoz2:25169:25855 [2] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]

kolyoz1:27224:27935 [2] ib_plugin.c:1105 NCCL WARN NET/IB : Got completion from peer 10.0.35.2<41039> with error 4, opcode 5267, len 5267, vendor err 81 (Send)
kolyoz1:27224:27935 [2] NCCL INFO transport/net.cc:1137 -> 6
kolyoz1:27224:27935 [2] NCCL INFO proxy.cc:698 -> 6

kolyoz2:25148:25857 [3] ib_plugin.c:1105 NCCL WARN NET/IB : Got completion from peer 10.0.35.1<33687> with error 4, opcode 5224, len 5224, vendor err 81 (Send)
kolyoz2:25148:25857 [3] NCCL INFO transport/net.cc:1137 -> 6
kolyoz2:25148:25857 [3] NCCL INFO proxy.cc:698 -> 6
kolyoz2:25148:25857 [3] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]
kolyoz1:27225:27937 [3] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]
kolyoz1:27224:27935 [2] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]

kolyoz1:27255:27934 [0] ib_plugin.c:1105 NCCL WARN NET/IB : Got completion from peer 10.0.35.2<42949> with error 4, opcode 5270, len 5270, vendor err 81 (Send)
kolyoz1:27255:27934 [0] NCCL INFO transport/net.cc:1137 -> 6
kolyoz1:27255:27934 [0] NCCL INFO proxy.cc:698 -> 6
kolyoz1:27255:27934 [0] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]

kolyoz2:25146:25854 [0] ib_plugin.c:1105 NCCL WARN NET/IB : Got completion from peer 10.0.35.1<49189> with error 4, opcode 5327, len 5327, vendor err 81 (Send)
kolyoz2:25146:25854 [0] NCCL INFO transport/net.cc:1137 -> 6
kolyoz2:25146:25854 [0] NCCL INFO proxy.cc:698 -> 6
kolyoz2:25146:25854 [0] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]
kolyoz2:25146:25852 [0] NCCL INFO [Service thread] Connection closed by localRank 0
kolyoz2:25172:25848 [1] NCCL INFO [Service thread] Connection closed by localRank 0
kolyoz2:25148:25846 [3] NCCL INFO [Service thread] Connection closed by localRank 0
kolyoz2:25169:25849 [2] NCCL INFO [Service thread] Connection closed by localRank 0
kolyoz1:27255:27930 [0] NCCL INFO [Service thread] Connection closed by localRank 2
kolyoz1:27223:27927 [1] NCCL INFO [Service thread] Connection closed by localRank 2
kolyoz1:27225:27928 [3] NCCL INFO [Service thread] Connection closed by localRank 2
kolyoz1:27224:27926 [2] NCCL INFO [Service thread] Connection closed by localRank 2
kolyoz1:27255:27930 [0] NCCL INFO [Service thread] Connection closed by localRank 3
kolyoz1:27223:27927 [1] NCCL INFO [Service thread] Connection closed by localRank 3
kolyoz1:27224:27926 [2] NCCL INFO [Service thread] Connection closed by localRank 3
kolyoz1:27225:27928 [3] NCCL INFO [Service thread] Connection closed by localRank 3
kolyoz1:27255:27930 [0] NCCL INFO [Service thread] Connection closed by localRank 0
kolyoz1:27223:27927 [1] NCCL INFO [Service thread] Connection closed by localRank 0
kolyoz1:27224:27926 [2] NCCL INFO [Service thread] Connection closed by localRank 0
kolyoz1:27225:27928 [3] NCCL INFO [Service thread] Connection closed by localRank 0
kolyoz2:25146:25852 [0] NCCL INFO [Service thread] Connection closed by localRank 1
kolyoz2:25172:25848 [1] NCCL INFO [Service thread] Connection closed by localRank 1
kolyoz2:25169:25849 [2] NCCL INFO [Service thread] Connection closed by localRank 1
kolyoz2:25148:25846 [3] NCCL INFO [Service thread] Connection closed by localRank 1
kolyoz1:27255:27930 [0] NCCL INFO [Service thread] Connection closed by localRank 1
kolyoz1:27223:27927 [1] NCCL INFO [Service thread] Connection closed by localRank 1
kolyoz1:27224:27926 [2] NCCL INFO [Service thread] Connection closed by localRank 1
kolyoz1:27225:27928 [3] NCCL INFO [Service thread] Connection closed by localRank 1
kolyoz1:27225:27225 [3] NCCL INFO comm 0x558b580881d0 rank 3 nranks 8 cudaDev 3 busId e6000 - Abort COMPLETE
kolyoz1: Test NCCL failure common.cu:303 'remote process exited or there was a network error / '
 .. kolyoz1 pid 27225: Test failure common.cu:401
 .. kolyoz1 pid 27225: Test failure common.cu:586
 .. kolyoz1 pid 27225: Test failure all_reduce.cu:90
 .. kolyoz1 pid 27225: Test failure common.cu:613
 .. kolyoz1 pid 27225: Test failure common.cu:1017
 .. kolyoz1 pid 27225: Test failure common.cu:843
kolyoz1:27255:27255 [0] NCCL INFO comm 0x555de9802860 rank 0 nranks 8 cudaDev 0 busId 42000 - Abort COMPLETE
kolyoz1: Test NCCL failure common.cu:303 'remote process exited or there was a network error / '
 .. kolyoz1 pid 27255: Test failure common.cu:401
 .. kolyoz1 pid 27255: Test failure common.cu:586
 .. kolyoz1 pid 27255: Test failure all_reduce.cu:90
 .. kolyoz1 pid 27255: Test failure common.cu:613
 .. kolyoz1 pid 27255: Test failure common.cu:1017
 .. kolyoz1 pid 27255: Test failure common.cu:843
kolyoz1:27224:27224 [2] NCCL INFO comm 0x5644540d62c0 rank 2 nranks 8 cudaDev 2 busId d4000 - Abort COMPLETE
kolyoz1: Test NCCL failure common.cu:303 'remote process exited or there was a network error / '
 .. kolyoz1 pid 27224: Test failure common.cu:401
 .. kolyoz1 pid 27224: Test failure common.cu:586
 .. kolyoz1 pid 27224: Test failure all_reduce.cu:90
 .. kolyoz1 pid 27224: Test failure common.cu:613
 .. kolyoz1 pid 27224: Test failure common.cu:1017
 .. kolyoz1 pid 27224: Test failure common.cu:843
kolyoz1:27223:27223 [1] NCCL INFO comm 0x55e38c069f50 rank 1 nranks 8 cudaDev 1 busId 55000 - Abort COMPLETE
kolyoz1: Test NCCL failure common.cu:303 'remote process exited or there was a network error / '
 .. kolyoz1 pid 27223: Test failure common.cu:401
 .. kolyoz1 pid 27223: Test failure common.cu:586
 .. kolyoz1 pid 27223: Test failure all_reduce.cu:90
 .. kolyoz1 pid 27223: Test failure common.cu:613
 .. kolyoz1 pid 27223: Test failure common.cu:1017
 .. kolyoz1 pid 27223: Test failure common.cu:843
kolyoz2:25146:25852 [0] NCCL INFO [Service thread] Connection closed by localRank 3
kolyoz2:25172:25848 [1] NCCL INFO [Service thread] Connection closed by localRank 3
kolyoz2:25169:25849 [2] NCCL INFO [Service thread] Connection closed by localRank 3
kolyoz2:25148:25846 [3] NCCL INFO [Service thread] Connection closed by localRank 3
kolyoz2:25146:25852 [0] NCCL INFO [Service thread] Connection closed by localRank 2
kolyoz2:25172:25848 [1] NCCL INFO [Service thread] Connection closed by localRank 2
kolyoz2:25148:25846 [3] NCCL INFO [Service thread] Connection closed by localRank 2
kolyoz2:25169:25849 [2] NCCL INFO [Service thread] Connection closed by localRank 2
kolyoz2:25148:25148 [3] NCCL INFO comm 0x5628b8d43360 rank 7 nranks 8 cudaDev 3 busId e6000 - Abort COMPLETE
kolyoz2: Test NCCL failure common.cu:303 'remote process exited or there was a network error / '
 .. kolyoz2 pid 25148: Test failure common.cu:401
 .. kolyoz2 pid 25148: Test failure common.cu:586
 .. kolyoz2 pid 25148: Test failure all_reduce.cu:90
 .. kolyoz2 pid 25148: Test failure common.cu:613
 .. kolyoz2 pid 25148: Test failure common.cu:1017
 .. kolyoz2 pid 25148: Test failure common.cu:843
kolyoz2:25146:25146 [0] NCCL INFO comm 0x559d5aa8ca00 rank 4 nranks 8 cudaDev 0 busId 42000 - Abort COMPLETE
kolyoz2: Test NCCL failure common.cu:303 'remote process exited or there was a network error / '
 .. kolyoz2 pid 25146: Test failure common.cu:401
 .. kolyoz2 pid 25146: Test failure common.cu:586
 .. kolyoz2 pid 25146: Test failure all_reduce.cu:90
 .. kolyoz2 pid 25146: Test failure common.cu:613
 .. kolyoz2 pid 25146: Test failure common.cu:1017
 .. kolyoz2 pid 25146: Test failure common.cu:843
kolyoz2:25172:25172 [1] NCCL INFO comm 0x556759b190e0 rank 5 nranks 8 cudaDev 1 busId 55000 - Abort COMPLETE
kolyoz2: Test NCCL failure common.cu:303 'remote process exited or there was a network error / '
 .. kolyoz2 pid 25172: Test failure common.cu:401
 .. kolyoz2 pid 25172: Test failure common.cu:586
 .. kolyoz2 pid 25172: Test failure all_reduce.cu:90
 .. kolyoz2 pid 25172: Test failure common.cu:613
 .. kolyoz2 pid 25172: Test failure common.cu:1017
 .. kolyoz2 pid 25172: Test failure common.cu:843
kolyoz2:25169:25169 [2] NCCL INFO comm 0x55a497ca6750 rank 6 nranks 8 cudaDev 2 busId d4000 - Abort COMPLETE
kolyoz2: Test NCCL failure common.cu:303 'remote process exited or there was a network error / '
 .. kolyoz2 pid 25169: Test failure common.cu:401
 .. kolyoz2 pid 25169: Test failure common.cu:586
 .. kolyoz2 pid 25169: Test failure all_reduce.cu:90
 .. kolyoz2 pid 25169: Test failure common.cu:613
 .. kolyoz2 pid 25169: Test failure common.cu:1017
 .. kolyoz2 pid 25169: Test failure common.cu:843
AddyLaddy commented 9 hours ago

Have you disabled ACS on both nodes?

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

If that doesn't solve it, then I'd suggest testing inter-node connectivity over IB between each node and with each NIC using something like ib_write_bw from the perftests package.