NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

misc/socket.cc:441 NCCL WARN socketFinalizeAccept: wrong type 4 != 3 #188

Closed MiyazonoKaori closed 7 months ago

MiyazonoKaori commented 7 months ago

single node or set NCCL_IB_DISABLE=1 is correctly. using IB (InfiniBand) result following error:

root@user:/home/nccl-tests-master# mpirun --allow-run-as-root -np 16 --hostfile mpi_hosts -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              user
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   user
  Local device: mlx5_0
--------------------------------------------------------------------------
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  16899 on       user device  0 [0x27] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid  16900 on       user device  1 [0x2a] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid  16901 on       user device  2 [0x51] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid  16902 on       user device  3 [0x57] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid  16903 on       user device  4 [0x9e] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid  16904 on       user device  5 [0xa4] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid  16905 on       user device  6 [0xc7] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid  16906 on       user device  7 [0xca] NVIDIA A100-SXM4-80GB
#  Rank  8 Group  0 Pid 695831 on       user device  0 [0x27] NVIDIA A100-SXM4-80GB
#  Rank  9 Group  0 Pid 695832 on       user device  1 [0x2a] NVIDIA A100-SXM4-80GB
#  Rank 10 Group  0 Pid 695833 on       user device  2 [0x51] NVIDIA A100-SXM4-80GB
#  Rank 11 Group  0 Pid 695834 on       user device  3 [0x57] NVIDIA A100-SXM4-80GB
#  Rank 12 Group  0 Pid 695835 on       user device  4 [0x9e] NVIDIA A100-SXM4-80GB
#  Rank 13 Group  0 Pid 695836 on       user device  5 [0xa4] NVIDIA A100-SXM4-80GB
#  Rank 14 Group  0 Pid 695837 on       user device  6 [0xc7] NVIDIA A100-SXM4-80GB
#  Rank 15 Group  0 Pid 695838 on       user device  7 [0xca] NVIDIA A100-SXM4-80GB
user:16899:16899 [0] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:16899:16899 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:16899:16899 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:16899:16899 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.5+cuda12.2
user:16901:16901 [2] NCCL INFO cudaDriverVersion 12020
user:16901:16901 [2] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:16901:16901 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:16901:16901 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:16905:16905 [6] NCCL INFO cudaDriverVersion 12020
user:16905:16905 [6] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:16905:16905 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:16905:16905 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:16906:16906 [7] NCCL INFO cudaDriverVersion 12020
user:16906:16906 [7] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:16906:16906 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:16906:16906 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:16903:16903 [4] NCCL INFO cudaDriverVersion 12020
user:16903:16903 [4] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:16903:16903 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:16903:16903 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:16902:16902 [3] NCCL INFO cudaDriverVersion 12020
user:16902:16902 [3] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:16902:16902 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:16902:16902 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:16900:16900 [1] NCCL INFO cudaDriverVersion 12020
user:16900:16900 [1] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:16900:16900 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:16900:16900 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:16904:16904 [5] NCCL INFO cudaDriverVersion 12020
user:695835:695835 [4] NCCL INFO cudaDriverVersion 12020
user:16904:16904 [5] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:16904:16904 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:16904:16904 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:695835:695835 [4] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:695835:695835 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:695835:695835 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:695833:695833 [2] NCCL INFO cudaDriverVersion 12020
user:695833:695833 [2] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:695833:695833 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:695833:695833 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:695836:695836 [5] NCCL INFO cudaDriverVersion 12020
user:695836:695836 [5] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:695836:695836 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:695836:695836 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:695837:695837 [6] NCCL INFO cudaDriverVersion 12020
user:695837:695837 [6] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:695837:695837 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:695837:695837 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:695834:695834 [3] NCCL INFO cudaDriverVersion 12020
user:695834:695834 [3] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:695834:695834 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:695834:695834 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:695831:695831 [0] NCCL INFO cudaDriverVersion 12020
user:695831:695831 [0] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:695831:695831 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:695831:695831 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:695838:695838 [7] NCCL INFO cudaDriverVersion 12020
user:695832:695832 [1] NCCL INFO cudaDriverVersion 12020
user:695832:695832 [1] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:695838:695838 [7] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:695838:695838 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:695838:695838 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:695832:695832 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:695832:695832 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
[user:16867] 15 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[user:16867] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[user:16867] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init
user:16899:16949 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:16899:16949 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0:12,mlx5_2:14,mlx5_6:13,mlx5_8:11
user:16905:16951 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:16905:16951 [6] NCCL INFO NCCL_IB_HCA set to mlx5_0:12,mlx5_2:14,mlx5_6:13,mlx5_8:11
user:16904:16956 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:16904:16956 [5] NCCL INFO NCCL_IB_HCA set to mlx5_0:12,mlx5_2:14,mlx5_6:13,mlx5_8:11
user:16901:16950 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:16901:16950 [2] NCCL INFO NCCL_IB_HCA set to mlx5_0:12,mlx5_2:14,mlx5_6:13,mlx5_8:11
user:16900:16955 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:16900:16955 [1] NCCL INFO NCCL_IB_HCA set to mlx5_0:12,mlx5_2:14,mlx5_6:13,mlx5_8:11
user:16902:16954 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:16902:16954 [3] NCCL INFO NCCL_IB_HCA set to mlx5_0:12,mlx5_2:14,mlx5_6:13,mlx5_8:11
user:16905:16951 [6] NCCL INFO NET/IB : No device found.
user:16905:16951 [6] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:16905:16951 [6] NCCL INFO Using network Socket
user:16899:16949 [0] NCCL INFO NET/IB : No device found.
user:16899:16949 [0] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:16899:16949 [0] NCCL INFO Using network Socket
user:16906:16952 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:16906:16952 [7] NCCL INFO NCCL_IB_HCA set to mlx5_0:12,mlx5_2:14,mlx5_6:13,mlx5_8:11
user:16904:16956 [5] NCCL INFO NET/IB : No device found.
user:16904:16956 [5] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:16904:16956 [5] NCCL INFO Using network Socket
user:16901:16950 [2] NCCL INFO NET/IB : No device found.
user:16901:16950 [2] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:16901:16950 [2] NCCL INFO Using network Socket
user:16900:16955 [1] NCCL INFO NET/IB : No device found.
user:16900:16955 [1] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:16900:16955 [1] NCCL INFO Using network Socket
user:16903:16953 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:16903:16953 [4] NCCL INFO NCCL_IB_HCA set to mlx5_0:12,mlx5_2:14,mlx5_6:13,mlx5_8:11
user:16902:16954 [3] NCCL INFO NET/IB : No device found.
user:16902:16954 [3] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:16902:16954 [3] NCCL INFO Using network Socket
user:16906:16952 [7] NCCL INFO NET/IB : No device found.
user:16906:16952 [7] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:16906:16952 [7] NCCL INFO Using network Socket
user:695836:695883 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [2]mlx5_4:1/RoCE [RO]; OOB ibs85f0:192.168.1.14<0>
user:695836:695883 [5] NCCL INFO Using network IB
user:695832:695887 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [2]mlx5_4:1/RoCE [RO]; OOB ibs85f0:192.168.1.14<0>
user:695832:695887 [1] NCCL INFO Using network IB
user:695837:695884 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [2]mlx5_4:1/RoCE [RO]; OOB ibs85f0:192.168.1.14<0>
user:695837:695884 [6] NCCL INFO Using network IB
user:695835:695881 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [2]mlx5_4:1/RoCE [RO]; OOB ibs85f0:192.168.1.14<0>
user:695835:695881 [4] NCCL INFO Using network IB
user:695831:695886 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [2]mlx5_4:1/RoCE [RO]; OOB ibs85f0:192.168.1.14<0>
user:695831:695886 [0] NCCL INFO Using network IB
user:695833:695882 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [2]mlx5_4:1/RoCE [RO]; OOB ibs85f0:192.168.1.14<0>
user:695833:695882 [2] NCCL INFO Using network IB
user:695834:695885 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [2]mlx5_4:1/RoCE [RO]; OOB ibs85f0:192.168.1.14<0>
user:695834:695885 [3] NCCL INFO Using network IB
user:16903:16953 [4] NCCL INFO NET/IB : No device found.
user:16903:16953 [4] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:16903:16953 [4] NCCL INFO Using network Socket
user:695838:695888 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [2]mlx5_4:1/RoCE [RO]; OOB ibs85f0:192.168.1.14<0>
user:695838:695888 [7] NCCL INFO Using network IB
user:16901:16950 [2] NCCL INFO comm 0x55bb002182d0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 51000 commId 0x88850de8e238e2f8 - Init START
user:16906:16952 [7] NCCL INFO comm 0x563499ab4560 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId ca000 commId 0x88850de8e238e2f8 - Init START
user:16900:16955 [1] NCCL INFO comm 0x563bebc16500 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x88850de8e238e2f8 - Init START
user:16899:16949 [0] NCCL INFO comm 0x559dbd753cb0 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 27000 commId 0x88850de8e238e2f8 - Init START
user:16904:16956 [5] NCCL INFO comm 0x55f8ad946270 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId a4000 commId 0x88850de8e238e2f8 - Init START
user:16902:16954 [3] NCCL INFO comm 0x55558e2713f0 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 57000 commId 0x88850de8e238e2f8 - Init START
user:16905:16951 [6] NCCL INFO comm 0x5570b6b05aa0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId c7000 commId 0x88850de8e238e2f8 - Init START
user:16903:16953 [4] NCCL INFO comm 0x55cc037d1610 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 9e000 commId 0x88850de8e238e2f8 - Init START
user:695837:695884 [6] NCCL INFO comm 0x5591ff414f80 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId c7000 commId 0x88850de8e238e2f8 - Init START
user:695836:695883 [5] NCCL INFO comm 0x55e6d4aecca0 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId a4000 commId 0x88850de8e238e2f8 - Init START
user:695835:695881 [4] NCCL INFO comm 0x564d60cbcf70 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId 9e000 commId 0x88850de8e238e2f8 - Init START
user:695838:695888 [7] NCCL INFO comm 0x564e77ea7950 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId ca000 commId 0x88850de8e238e2f8 - Init START
user:695834:695885 [3] NCCL INFO comm 0x55958ebf6590 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 57000 commId 0x88850de8e238e2f8 - Init START
user:695831:695886 [0] NCCL INFO comm 0x56438ce02d20 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 27000 commId 0x88850de8e238e2f8 - Init START
user:695832:695887 [1] NCCL INFO comm 0x5583e6efbf90 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x88850de8e238e2f8 - Init START
user:695833:695882 [2] NCCL INFO comm 0x562671f34b70 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 51000 commId 0x88850de8e238e2f8 - Init START
user:695836:695883 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
user:695836:695883 [5] NCCL INFO NVLS multicast support is not available on dev 5
user:695837:695884 [6] NCCL INFO NVLS multicast support is not available on dev 6
user:695833:695882 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
user:695833:695882 [2] NCCL INFO NVLS multicast support is not available on dev 2
user:695835:695881 [4] NCCL INFO NVLS multicast support is not available on dev 4
user:695832:695887 [1] NCCL INFO NVLS multicast support is not available on dev 1
user:695834:695885 [3] NCCL INFO NVLS multicast support is not available on dev 3
user:695831:695886 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
user:695831:695886 [0] NCCL INFO NVLS multicast support is not available on dev 0
user:695838:695888 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
user:695838:695888 [7] NCCL INFO NVLS multicast support is not available on dev 7
user:16900:16955 [1] NCCL INFO NVLS multicast support is not available on dev 1
user:16904:16956 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
user:16904:16956 [5] NCCL INFO NVLS multicast support is not available on dev 5
user:16906:16952 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
user:16906:16952 [7] NCCL INFO NVLS multicast support is not available on dev 7
user:16902:16954 [3] NCCL INFO NVLS multicast support is not available on dev 3
user:16899:16949 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
user:16899:16949 [0] NCCL INFO NVLS multicast support is not available on dev 0
user:16905:16951 [6] NCCL INFO NVLS multicast support is not available on dev 6
user:16901:16950 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
user:16901:16950 [2] NCCL INFO NVLS multicast support is not available on dev 2
user:16903:16953 [4] NCCL INFO NVLS multicast support is not available on dev 4
user:16906:16952 [7] NCCL INFO Trees [0] 0/-1/-1->7->6 [1] 0/-1/-1->7->6
user:16906:16952 [7] NCCL INFO P2P Chunksize set to 131072
user:16903:16953 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
user:16903:16953 [4] NCCL INFO P2P Chunksize set to 131072
user:16899:16949 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7   8  15  14  13  12  11  10   9
user:16899:16949 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7   8  15  14  13  12  11  10   9
user:16899:16949 [0] NCCL INFO Trees [0] 1/-1/-1->0->7 [1] 1/-1/-1->0->7
user:16899:16949 [0] NCCL INFO P2P Chunksize set to 131072
user:16905:16951 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
user:16905:16951 [6] NCCL INFO P2P Chunksize set to 131072
user:16904:16956 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
user:16904:16956 [5] NCCL INFO P2P Chunksize set to 131072
user:16901:16950 [2] NCCL INFO Trees [0] 3/10/-1->2->-1 [1] 3/-1/-1->2->10
user:16901:16950 [2] NCCL INFO P2P Chunksize set to 131072
user:16902:16954 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
user:16902:16954 [3] NCCL INFO P2P Chunksize set to 131072
user:16900:16955 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
user:16900:16955 [1] NCCL INFO P2P Chunksize set to 131072
user:695836:695883 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12
user:695836:695883 [5] NCCL INFO P2P Chunksize set to 131072
user:695837:695884 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13
user:695837:695884 [6] NCCL INFO P2P Chunksize set to 131072
user:695838:695888 [7] NCCL INFO Trees [0] 8/-1/-1->15->14 [1] 8/-1/-1->15->14
user:695838:695888 [7] NCCL INFO P2P Chunksize set to 131072
user:695835:695881 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11
user:695835:695881 [4] NCCL INFO P2P Chunksize set to 131072
user:695832:695887 [1] NCCL INFO Trees [0] -1/-1/-1->9->8 [1] -1/-1/-1->9->8
user:695832:695887 [1] NCCL INFO P2P Chunksize set to 131072
user:695831:695886 [0] NCCL INFO Trees [0] 9/-1/-1->8->15 [1] 9/-1/-1->8->15
user:695831:695886 [0] NCCL INFO P2P Chunksize set to 131072
user:695833:695882 [2] NCCL INFO Trees [0] 11/-1/-1->10->2 [1] 11/2/-1->10->-1
user:695833:695882 [2] NCCL INFO P2P Chunksize set to 131072
user:695834:695885 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10
user:695834:695885 [3] NCCL INFO P2P Chunksize set to 131072
user:16899:16949 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read
user:16899:16949 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read
user:695837:695884 [6] NCCL INFO Channel 00/0 : 14[6] -> 13[5] via P2P/IPC/read
user:695836:695883 [5] NCCL INFO Channel 00/0 : 13[5] -> 12[4] via P2P/IPC/read
user:695835:695881 [4] NCCL INFO Channel 00/0 : 12[4] -> 11[3] via P2P/IPC/read
user:695833:695882 [2] NCCL INFO Channel 00/0 : 10[2] -> 9[1] via P2P/IPC/read
user:695837:695884 [6] NCCL INFO Channel 01/0 : 14[6] -> 13[5] via P2P/IPC/read
user:695834:695885 [3] NCCL INFO Channel 00/0 : 11[3] -> 10[2] via P2P/IPC/read
user:695836:695883 [5] NCCL INFO Channel 01/0 : 13[5] -> 12[4] via P2P/IPC/read
user:695835:695881 [4] NCCL INFO Channel 01/0 : 12[4] -> 11[3] via P2P/IPC/read
user:695833:695882 [2] NCCL INFO Channel 01/0 : 10[2] -> 9[1] via P2P/IPC/read
user:695834:695885 [3] NCCL INFO Channel 01/0 : 11[3] -> 10[2] via P2P/IPC/read
user:16900:16955 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read
user:16902:16954 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read
user:16903:16953 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read
user:16900:16955 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read
user:16901:16950 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read
user:16902:16954 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read
user:16903:16953 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read
user:16901:16950 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read
user:695831:695886 [0] NCCL INFO Channel 00/0 : 7[7] -> 8[0] [receive] via NET/IB/0
user:695831:695886 [0] NCCL INFO Channel 01/0 : 7[7] -> 8[0] [receive] via NET/IB/0
user:695832:695887 [1] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [send] via NET/IB/0
user:695832:695887 [1] NCCL INFO Channel 01/0 : 9[1] -> 0[0] [send] via NET/IB/0
user:16904:16956 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read
user:695836:695883 [5] NCCL INFO Connected all rings
user:16906:16952 [7] NCCL INFO Channel 00/0 : 7[7] -> 8[0] [send] via NET/Socket/3
user:16906:16952 [7] NCCL INFO Channel 01/0 : 7[7] -> 8[0] [send] via NET/Socket/3
user:695834:695885 [3] NCCL INFO Connected all rings
user:16904:16956 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read
user:695835:695881 [4] NCCL INFO Connected all rings
user:695836:695883 [5] NCCL INFO Channel 00/0 : 13[5] -> 14[6] via P2P/IPC/read
user:695831:695886 [0] NCCL INFO Channel 00/0 : 8[0] -> 15[7] via P2P/IPC/read
user:16899:16949 [0] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
user:695836:695883 [5] NCCL INFO Channel 01/0 : 13[5] -> 14[6] via P2P/IPC/read
user:16899:16949 [0] NCCL INFO Channel 01/0 : 9[1] -> 0[0] [receive] via NET/Socket/0
user:695834:695885 [3] NCCL INFO Channel 00/0 : 11[3] -> 12[4] via P2P/IPC/read
user:695831:695886 [0] NCCL INFO Channel 01/0 : 8[0] -> 15[7] via P2P/IPC/read
user:695835:695881 [4] NCCL INFO Channel 00/0 : 12[4] -> 13[5] via P2P/IPC/read
user:695834:695885 [3] NCCL INFO Channel 01/0 : 11[3] -> 12[4] via P2P/IPC/read
user:695838:695888 [7] NCCL INFO Channel 00/0 : 15[7] -> 14[6] via P2P/IPC/read
user:695835:695881 [4] NCCL INFO Channel 01/0 : 12[4] -> 13[5] via P2P/IPC/read
user:16905:16951 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/IPC/read
user:695838:695888 [7] NCCL INFO Channel 01/0 : 15[7] -> 14[6] via P2P/IPC/read
user:695835:695881 [4] NCCL INFO Connected all trees
user:695835:695881 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:695835:695881 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:16905:16951 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/IPC/read

user:16899:16961 [0] misc/socket.cc:441 NCCL WARN socketFinalizeAccept: wrong type 4 != 3
user:16899:16961 [0] NCCL INFO misc/socket.cc:561 -> 3
user:16899:16961 [0] NCCL INFO misc/socket.cc:586 -> 3
user:16899:16961 [0] NCCL INFO transport/net_socket.cc:378 -> 3
user:16899:16961 [0] NCCL INFO transport/net.cc:728 -> 3
user:16899:16961 [0] NCCL INFO proxy.cc:1306 -> 3

user:16899:16961 [0] proxy.cc:1485 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

user:16899:16961 [0] proxy.cc:1519 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
user:16902:16954 [3] NCCL INFO Connected all rings
user:16902:16954 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read
user:16900:16955 [1] NCCL INFO Connected all rings
user:16900:16955 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read
user:16902:16954 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read
user:695833:695882 [2] NCCL INFO Connected all rings
user:695833:695882 [2] NCCL INFO Channel 00/0 : 10[2] -> 11[3] via P2P/IPC/read
user:16900:16955 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read
user:16903:16953 [4] NCCL INFO Connected all rings
user:16901:16950 [2] NCCL INFO Connected all rings
user:695833:695882 [2] NCCL INFO Channel 01/0 : 10[2] -> 11[3] via P2P/IPC/read
user:695837:695884 [6] NCCL INFO Connected all rings
user:695838:695888 [7] NCCL INFO Connected all rings

user:695831:695922 [0] misc/socket.cc:441 NCCL WARN socketFinalizeAccept: wrong type 3 != 4
user:695831:695922 [0] NCCL INFO misc/socket.cc:561 -> 3
user:695831:695922 [0] NCCL INFO misc/socket.cc:586 -> 3
user:695831:695922 [0] NCCL INFO transport/net_ib.cc:746 -> 3
user:695831:695922 [0] NCCL INFO transport/net.cc:728 -> 3
user:695831:695922 [0] NCCL INFO proxy.cc:1306 -> 3

user:695831:695922 [0] proxy.cc:1485 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

user:695831:695922 [0] proxy.cc:1519 NCCL WARN [Proxy Service 8] Failed to execute operation Connect from rank 8, retcode 3
user:695837:695884 [6] NCCL INFO Channel 00/0 : 14[6] -> 15[7] via P2P/IPC/read
user:695834:695885 [3] NCCL INFO Connected all trees
user:695834:695885 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:695834:695885 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:695837:695884 [6] NCCL INFO Channel 01/0 : 14[6] -> 15[7] via P2P/IPC/read
user:695838:695888 [7] NCCL INFO Channel 00/0 : 15[7] -> 8[0] via P2P/IPC/read
user:16903:16953 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read
user:695838:695888 [7] NCCL INFO Channel 01/0 : 15[7] -> 8[0] via P2P/IPC/read
user:16901:16950 [2] NCCL INFO Channel 00/0 : 10[2] -> 2[2] [receive] via NET/Socket/0
user:695833:695882 [2] NCCL INFO Channel 00/0 : 2[2] -> 10[2] [receive] via NET/IB/0
user:695833:695882 [2] NCCL INFO Channel 01/0 : 2[2] -> 10[2] [receive] via NET/IB/0
user:695833:695882 [2] NCCL INFO Channel 00/0 : 10[2] -> 2[2] [send] via NET/IB/0
user:695833:695882 [2] NCCL INFO Channel 01/0 : 10[2] -> 2[2] [send] via NET/IB/0
user:16901:16950 [2] NCCL INFO Channel 01/0 : 10[2] -> 2[2] [receive] via NET/Socket/0
user:695836:695883 [5] NCCL INFO Connected all trees
user:695836:695883 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:695836:695883 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:16901:16950 [2] NCCL INFO Channel 00/0 : 2[2] -> 10[2] [send] via NET/Socket/0
user:695837:695884 [6] NCCL INFO Connected all trees
user:695837:695884 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:695837:695884 [6] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:16901:16950 [2] NCCL INFO Channel 01/0 : 2[2] -> 10[2] [send] via NET/Socket/0
user:16903:16953 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read
user:16904:16956 [5] NCCL INFO Connected all rings
user:16905:16951 [6] NCCL INFO Connected all rings
user:16904:16956 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read

user:695833:695920 [2] misc/socket.cc:441 NCCL WARN socketFinalizeAccept: wrong type 3 != 4
user:695833:695920 [2] NCCL INFO misc/socket.cc:561 -> 3
user:695833:695920 [2] NCCL INFO misc/socket.cc:665 -> 3
user:695833:695920 [2] NCCL INFO transport/net_ib.cc:743 -> 3
user:695833:695920 [2] NCCL INFO transport/net.cc:728 -> 3
user:695833:695920 [2] NCCL INFO proxy.cc:1306 -> 3
user:695833:695920 [2] NCCL INFO proxy.cc:1377 -> 3

user:695833:695920 [2] proxy.cc:1519 NCCL WARN [Proxy Service 10] Failed to execute operation Connect from rank 10, retcode 3
user:16904:16956 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read
user:16905:16951 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read
user:16902:16954 [3] NCCL INFO Connected all trees
user:16902:16954 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:16902:16954 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:16905:16951 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read

user:16901:16967 [2] misc/socket.cc:441 NCCL WARN socketFinalizeAccept: wrong type 4 != 3
user:16901:16967 [2] NCCL INFO misc/socket.cc:561 -> 3
user:16901:16967 [2] NCCL INFO misc/socket.cc:665 -> 3
user:16901:16967 [2] NCCL INFO transport/net_socket.cc:375 -> 3
user:16901:16967 [2] NCCL INFO transport/net.cc:728 -> 3
user:16901:16967 [2] NCCL INFO proxy.cc:1306 -> 3
user:16901:16967 [2] NCCL INFO proxy.cc:1377 -> 3

user:16901:16967 [2] proxy.cc:1519 NCCL WARN [Proxy Service 2] Failed to execute operation Connect from rank 2, retcode 3
user:16903:16953 [4] NCCL INFO Connected all trees
user:16903:16953 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:16903:16953 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:16904:16956 [5] NCCL INFO Connected all trees
user:16904:16956 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:16904:16956 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:16906:16952 [7] NCCL INFO Connected all rings

user:695833:695882 [2] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer user<34511>
user:695833:695882 [2] NCCL INFO misc/socket.cc:749 -> 6

user:695833:695882 [2] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f085674fbe0
user:695833:695882 [2] NCCL INFO transport/net.cc:288 -> 3
user:695833:695882 [2] NCCL INFO transport.cc:148 -> 3
user:695833:695882 [2] NCCL INFO init.cc:1089 -> 3
user:695833:695882 [2] NCCL INFO init.cc:1358 -> 3
user:695833:695882 [2] NCCL INFO group.cc:65 -> 3 [Async thread]
user:695833:695833 [2] NCCL INFO group.cc:406 -> 3
user:695833:695833 [2] NCCL INFO group.cc:96 -> 3

user:695831:695886 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer user<44669>
user:695831:695886 [0] NCCL INFO misc/socket.cc:749 -> 6

user:695831:695886 [0] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f01a2750988
user:695831:695886 [0] NCCL INFO transport/net.cc:362 -> 3
user:695831:695886 [0] NCCL INFO transport.cc:168 -> 3
user:695831:695886 [0] NCCL INFO init.cc:1079 -> 3
user:695831:695886 [0] NCCL INFO init.cc:1358 -> 3
user:695831:695886 [0] NCCL INFO group.cc:65 -> 3 [Async thread]
user: Test NCCL failure common.cu:961 'internal error - please report this issue to the NCCL developers / '
 .. user pid 695833: Test failure common.cu:844

user:695833:695920 [32520] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:695833:695920 [32520] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:695833:695920 [32520] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:695833:695920 [32520] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:695833:695920 [32520] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:695833:695920 [32520] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'
user:695831:695831 [0] NCCL INFO group.cc:406 -> 3
user:695831:695831 [0] NCCL INFO group.cc:96 -> 3
user: Test NCCL failure common.cu:961 'internal error - please report this issue to the NCCL developers / '
 .. user pid 695831: Test failure common.cu:844

user:695831:695922 [32513] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:695831:695922 [32513] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:16901:16950 [2] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer user<49695>
user:16901:16950 [2] NCCL INFO misc/socket.cc:749 -> 6

user:16901:16950 [2] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7fdf26755630
user:16901:16950 [2] NCCL INFO transport/net.cc:362 -> 3
user:16901:16950 [2] NCCL INFO transport.cc:168 -> 3
user:16901:16950 [2] NCCL INFO init.cc:1089 -> 3
user:16901:16950 [2] NCCL INFO init.cc:1358 -> 3
user:16901:16950 [2] NCCL INFO group.cc:65 -> 3 [Async thread]

user:16899:16949 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer user<37597>
user:16899:16949 [0] NCCL INFO misc/socket.cc:749 -> 6

user:16899:16949 [0] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7fde9e755428
user:16899:16949 [0] NCCL INFO transport/net.cc:362 -> 3
user:16899:16949 [0] NCCL INFO transport.cc:168 -> 3
user:16899:16949 [0] NCCL INFO init.cc:1079 -> 3
user:16899:16949 [0] NCCL INFO init.cc:1358 -> 3
user:16899:16949 [0] NCCL INFO group.cc:65 -> 3 [Async thread]
user:16901:16901 [2] NCCL INFO group.cc:406 -> 3
user:16901:16901 [2] NCCL INFO group.cc:96 -> 3
user: Test NCCL failure common.cu:961 'internal error - please report this issue to the NCCL developers / '
 .. user pid 16901: Test failure common.cu:844
user:16899:16899 [0] NCCL INFO group.cc:406 -> 3
user:16899:16899 [0] NCCL INFO group.cc:96 -> 3
user: Test NCCL failure common.cu:961 'internal error - please report this issue to the NCCL developers / '
 .. user pid 16899: Test failure common.cu:844

user:16901:16967 [32735] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:16901:16967 [32735] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:16901:16967 [32735] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:16901:16967 [32735] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:16901:16967 [909195890] include/alloc.h:39 NCCL WARN Cuda failure 'driver shutting down'
user:16901:16967 [32735] NCCL INFO transport/net.cc:836 -> 1
user:16901:16967 [1936613746] NCCL INFO proxy.cc:963 -> 1
user:16901:16967 [1936613746] NCCL INFO proxy.cc:979 -> 1

user:16899:16961 [32735] include/alloc.h:250 NCCL WARN Cuda failure 'driver shutting down'

user:16899:16961 [32735] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:695832:695921 [1] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer 192.168.1.10<48899>
user:695832:695921 [1] NCCL INFO misc/socket.cc:749 -> 6
user:695832:695921 [1] NCCL INFO transport/net_ib.cc:693 -> 6
user:695832:695921 [1] NCCL INFO transport/net.cc:592 -> 6
user:695832:695921 [1] NCCL INFO proxy.cc:1306 -> 6
user:695832:695921 [1] NCCL INFO proxy.cc:1377 -> 6

user:695832:695921 [1] proxy.cc:1519 NCCL WARN [Proxy Service 9] Failed to execute operation Connect from rank 9, retcode 6

user:695832:695887 [1] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer user<53125>
user:695832:695887 [1] NCCL INFO misc/socket.cc:749 -> 6

user:695832:695887 [1] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f980274f790
user:695832:695887 [1] NCCL INFO transport/net.cc:288 -> 3
user:695832:695887 [1] NCCL INFO transport.cc:148 -> 3
user:695832:695887 [1] NCCL INFO init.cc:1079 -> 3
user:695832:695887 [1] NCCL INFO init.cc:1358 -> 3
user:695832:695887 [1] NCCL INFO group.cc:65 -> 3 [Async thread]
user:695832:695832 [1] NCCL INFO group.cc:406 -> 3
user:695832:695832 [1] NCCL INFO group.cc:96 -> 3
user: Test NCCL failure common.cu:961 'internal error - please report this issue to the NCCL developers / '
 .. user pid 695832: Test failure common.cu:844

user:695832:695921 [32664] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'

user:695832:695921 [32664] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[54064,1],8]
  Exit code:    3
--------------------------------------------------------------------------
sjeaugey commented 7 months ago

This is basically two ranks complaining that they are not using the same transport (one is using socket, the other is using IB). You can see that as some ranks are in net_ib.cc when others are in net_socket.cc.

MiyazonoKaori commented 7 months ago

@sjeaugey How should I fix this error? Modify the environment variables? Or reinstall nccl? This is my network environment.

` root@user:/home/user# ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.39.1002 Hardware version: 0 Node GUID: 0xe8ebd30300229550 System image GUID: 0xe8ebd30300229550 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 9 LMC: 0 SM lid: 9 Capability mask: 0xa651e84a Port GUID: 0xe8ebd30300229550 Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.39.1002 Hardware version: 0 Node GUID: 0xe8ebd30300229551 System image GUID: 0xe8ebd30300229550 Port 1: State: Down Physical state: Disabled Rate: 10 Base lid: 65535 LMC: 0 SM lid: 0 Capability mask: 0xa651e848 Port GUID: 0xe8ebd30300229551 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.39.1002 Hardware version: 0 Node GUID: 0xb83fd203001ed0e4 System image GUID: 0xb83fd203001ed0e4 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 10 LMC: 0 SM lid: 9 Capability mask: 0xa651e848 Port GUID: 0xb83fd203001ed0e4 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.39.1002 Hardware version: 0 Node GUID: 0xb83fd203001ed0e5 System image GUID: 0xb83fd203001ed0e4 Port 1: State: Down Physical state: Disabled Rate: 10 Base lid: 65535 LMC: 0 SM lid: 0 Capability mask: 0xa651e848 Port GUID: 0xb83fd203001ed0e5 Link layer: InfiniBand CA 'mlx5_4' CA type: MT4117 Number of ports: 1 Firmware version: 14.32.1010 Hardware version: 0 Node GUID: 0xb83fd20300283fda System image GUID: 0xb83fd20300283fda Port 1: State: Active Physical state: LinkUp Rate: 2.5 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xba3fd2fffe283fda Link layer: Ethernet CA 'mlx5_5' CA type: MT4117 Number of ports: 1 Firmware version: 14.32.1010 Hardware version: 0 Node GUID: 0xb83fd20300283fdb System image GUID: 0xb83fd20300283fda Port 1: State: Down Physical state: Disabled Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xba3fd2fffe283fdb Link layer: Ethernet root@user:/home/user# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 4: usb0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether e6:cf:fa:a4:5b:58 brd ff:ff:ff:ff:ff:ff 27: ens97f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether b8:3f:d2:28:3f:da brd ff:ff:ff:ff:ff:ff inet 10.42.45.2/16 brd 10.42.255.255 scope global ens97f0np0 valid_lft forever preferred_lft forever inet6 fe80::ba3f:d2ff:fe28:3fda/64 scope link valid_lft forever preferred_lft forever 28: ens97f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:3f:d2:28:3f:db brd ff:ff:ff:ff:ff:ff 29: ibs85f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256 link/infiniband 00:00:06:8b:fe:80:00:00:00:00:00:00:e8:eb:d3:03:00:22:95:50 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 192.168.1.14/24 brd 192.168.1.255 scope global ibs85f0 valid_lft forever preferred_lft forever inet6 fe80::eaeb:d303:22:9550/64 scope link valid_lft forever preferred_lft forever 30: ibs85f1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256 link/infiniband 00:00:11:49:fe:80:00:00:00:00:00:00:e8:eb:d3:03:00:22:95:51 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 31: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256 link/infiniband 00:00:10:49:fe:80:00:00:00:00:00:00:b8:3f:d2:03:00:1e:d0:e4 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 192.168.1.15/24 brd 192.168.1.255 scope global ib0 valid_lft forever preferred_lft forever inet6 fe80::ba3f:d203:1e:d0e4/64 scope link valid_lft forever preferred_lft forever 32: ib1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256 link/infiniband 00:00:11:49:fe:80:00:00:00:00:00:00:b8:3f:d2:03:00:1e:d0:e5 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 33: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:73:55:e8:80 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:73ff:fe55:e880/64 scope link valid_lft forever preferred_lft forever 35: veth3bb818d@if34: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default link/ether da:33:f7:94:b5:52 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::d833:f7ff:fe94:b552/64 scope link valid_lft forever preferred_lft forever

~/.bashrc

export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export CUDA_HOME=/usr/local/cuda

export MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi export OMPI_ALLOW_RUN_AS_ROOT=1 export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1

export NCCL_IB_DISABLE=0 export NCCL_IB_HCA=mlx5_0:9,mlx5_2:10 export NCCL_DEBUG=INFO export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

`

sjeaugey commented 7 months ago

Look like you want to use Infiniband, then make sure your Infiniband setup is working on both nodes. Otherwise you could set NCCL_IB_DISABLE=1 to use sockets but it will be much slower.

MiyazonoKaori commented 7 months ago

@sjeaugey Yes, I want to use Infiniband. Using ibping to test, the two nodes are connected. When I set NCCL_IB_DISABLE=1, the nccl-test works fine, but strangely, its bandwidth is much faster than that of the fiber network (100MB/s) and yet slower than IB's bandwidth (20GB/s). This is very confusing for me, and I don't know what the problem is or how to fix it. Thank you for your help.

ibping:

node1:
root@user:/home/user# 
root@user:/home/user# 
root@user:/home/user# sudo ibping -S -C mlx5_0 -P 1
^C
root@user:/home/user# sudo ibping -S -C mlx5_2 -P 1
^C
root@user:/home/user# sudo ibping -S -C mlx5_6 -P 1
^C
root@user:/home/user# sudo ibping -S -C mlx5_8 -P 1
^C
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 9

--- user.(none) (Lid 9) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7036 ms
rtt min/avg/max = 0.031/0.703/900.078 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 9

--- user.(none) (Lid 9) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7922 ms
rtt min/avg/max = 0.021/0.792/900.080 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_6 -P 1 -L 9

--- user.(none) (Lid 9) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7940 ms
rtt min/avg/max = 0.032/0.793/900.080 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_8 -P 1 -L 9

--- user.(none) (Lid 9) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7938 ms
rtt min/avg/max = 0.033/0.793/900.078 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 10

--- user.(none) (Lid 10) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7046 ms
rtt min/avg/max = 0.031/0.704/900.080 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 10

--- user.(none) (Lid 10) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7931 ms
rtt min/avg/max = 0.026/0.793/900.078 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_6 -P 1 -L 10

--- user.(none) (Lid 10) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7981 ms
rtt min/avg/max = 0.027/0.798/900.085 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_8 -P 1 -L 10

--- user.(none) (Lid 10) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7977 ms
rtt min/avg/max = 0.032/0.797/900.079 ms
root@user:/home/user# 

node2:
root@user:/home/nccl-tests-master# 
root@user:/home/nccl-tests-master# 
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 1

--- user.(none) (Lid 1) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7967 ms
rtt min/avg/max = 0.034/0.796/900.086 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 1

--- user.(none) (Lid 1) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7953 ms
rtt min/avg/max = 0.033/0.795/900.086 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 2

--- user.(none) (Lid 2) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7063 ms
rtt min/avg/max = 0.032/0.706/900.084 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 2

--- user.(none) (Lid 2) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7064 ms
rtt min/avg/max = 0.034/0.706/900.081 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 3

--- user.(none) (Lid 3) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7072 ms
rtt min/avg/max = 0.031/0.707/900.080 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 3

--- user.(none) (Lid 3) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7067 ms
rtt min/avg/max = 0.025/0.706/900.082 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 4

--- user.(none) (Lid 4) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7973 ms
rtt min/avg/max = 0.020/0.797/900.086 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 4

--- user.(none) (Lid 4) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7067 ms
rtt min/avg/max = 0.031/0.706/900.087 ms
root@user:/home/nccl-tests-master# 
root@user:/home/nccl-tests-master# 
root@user:/home/nccl-tests-master# sudo ibping -S -C mlx5_0 -P 1
^C
root@user:/home/nccl-tests-master# sudo ibping -S -C mlx5_2 -P 1
^C
root@user:/home/nccl-tests-master# 

NCCL_IB_DISABLE=1 detailed log:

root@user:/home/nccl-tests-master# mpirun --allow-run-as-root -np 16 --hostfile mpi_hosts -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=1  ./build/all_reduce_perf -b 128M -e 512M -f 2
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              user
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   user
  Local device: mlx5_0
--------------------------------------------------------------------------
# nThread 1 nGpus 1 minBytes 134217728 maxBytes 536870912 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 852750 on       user device  0 [0x27] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 852751 on       user device  1 [0x2a] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 852752 on       user device  2 [0x51] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 852753 on       user device  3 [0x57] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid 852754 on       user device  4 [0x9e] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid 852755 on       user device  5 [0xa4] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid 852756 on       user device  6 [0xc7] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid 852757 on       user device  7 [0xca] NVIDIA A100-SXM4-80GB
#  Rank  8 Group  0 Pid 169125 on       user device  0 [0x27] NVIDIA A100-SXM4-80GB
#  Rank  9 Group  0 Pid 169126 on       user device  1 [0x2a] NVIDIA A100-SXM4-80GB
#  Rank 10 Group  0 Pid 169127 on       user device  2 [0x51] NVIDIA A100-SXM4-80GB
#  Rank 11 Group  0 Pid 169128 on       user device  3 [0x57] NVIDIA A100-SXM4-80GB
#  Rank 12 Group  0 Pid 169129 on       user device  4 [0x9e] NVIDIA A100-SXM4-80GB
#  Rank 13 Group  0 Pid 169130 on       user device  5 [0xa4] NVIDIA A100-SXM4-80GB
#  Rank 14 Group  0 Pid 169131 on       user device  6 [0xc7] NVIDIA A100-SXM4-80GB
#  Rank 15 Group  0 Pid 169132 on       user device  7 [0xca] NVIDIA A100-SXM4-80GB
user:852750:852750 [0] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852750:852750 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852750:852750 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852750:852750 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.1+cuda12.1
user:852755:852755 [5] NCCL INFO cudaDriverVersion 12020
user:852755:852755 [5] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852755:852755 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852755:852755 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852757:852757 [7] NCCL INFO cudaDriverVersion 12020
user:852757:852757 [7] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852757:852757 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852757:852757 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852756:852756 [6] NCCL INFO cudaDriverVersion 12020
user:852756:852756 [6] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852756:852756 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852756:852756 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852752:852752 [2] NCCL INFO cudaDriverVersion 12020
user:852752:852752 [2] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852752:852752 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852752:852752 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852751:852751 [1] NCCL INFO cudaDriverVersion 12020
user:852751:852751 [1] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852751:852751 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852751:852751 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852753:852753 [3] NCCL INFO cudaDriverVersion 12020
user:852753:852753 [3] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852753:852753 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852753:852753 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169128:169128 [3] NCCL INFO cudaDriverVersion 12020
user:852754:852754 [4] NCCL INFO cudaDriverVersion 12020
user:852754:852754 [4] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852754:852754 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852754:852754 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169128:169128 [3] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169128:169128 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169128:169128 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169126:169126 [1] NCCL INFO cudaDriverVersion 12020
user:169126:169126 [1] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169126:169126 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169126:169126 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169132:169132 [7] NCCL INFO cudaDriverVersion 12020
user:169132:169132 [7] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169132:169132 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169132:169132 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169127:169127 [2] NCCL INFO cudaDriverVersion 12020
user:169127:169127 [2] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169127:169127 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169127:169127 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169130:169130 [5] NCCL INFO cudaDriverVersion 12020
user:169130:169130 [5] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169130:169130 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169130:169130 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169125:169125 [0] NCCL INFO cudaDriverVersion 12020
user:169125:169125 [0] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169125:169125 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169125:169125 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169129:169129 [4] NCCL INFO cudaDriverVersion 12020
user:169129:169129 [4] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169129:169129 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169129:169129 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169131:169131 [6] NCCL INFO cudaDriverVersion 12020
user:169131:169131 [6] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169131:169131 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169131:169131 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
[user:852715] 15 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[user:852715] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[user:852715] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init
user:852750:852802 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852750:852802 [0] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852750:852802 [0] NCCL INFO Using network Socket
user:852751:852807 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852751:852807 [1] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852751:852807 [1] NCCL INFO Using network Socket
user:852753:852808 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852753:852808 [3] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852753:852808 [3] NCCL INFO Using network Socket
user:852757:852804 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852757:852804 [7] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852757:852804 [7] NCCL INFO Using network Socket
user:852755:852803 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852755:852803 [5] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852755:852803 [5] NCCL INFO Using network Socket
user:169126:169178 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169126:169178 [1] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169126:169178 [1] NCCL INFO Using network Socket
user:169132:169177 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169132:169177 [7] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169132:169177 [7] NCCL INFO Using network Socket
user:169130:169180 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169130:169180 [5] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169130:169180 [5] NCCL INFO Using network Socket
user:169127:169179 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169127:169179 [2] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169127:169179 [2] NCCL INFO Using network Socket
user:169128:169176 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169128:169176 [3] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169128:169176 [3] NCCL INFO Using network Socket
user:852752:852806 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852752:852806 [2] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852752:852806 [2] NCCL INFO Using network Socket
user:169125:169181 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169125:169181 [0] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169125:169181 [0] NCCL INFO Using network Socket
user:852754:852809 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852754:852809 [4] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852754:852809 [4] NCCL INFO Using network Socket
user:169129:169182 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169129:169182 [4] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169129:169182 [4] NCCL INFO Using network Socket
user:852756:852805 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852756:852805 [6] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852756:852805 [6] NCCL INFO Using network Socket
user:169131:169183 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169131:169183 [6] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169131:169183 [6] NCCL INFO Using network Socket
user:169128:169176 [3] NCCL INFO NVLS multicast support is not available on dev 3
user:169131:169183 [6] NCCL INFO NVLS multicast support is not available on dev 6
user:852756:852805 [6] NCCL INFO NVLS multicast support is not available on dev 6
user:169126:169178 [1] NCCL INFO NVLS multicast support is not available on dev 1
user:852755:852803 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
user:852755:852803 [5] NCCL INFO NVLS multicast support is not available on dev 5
user:852751:852807 [1] NCCL INFO NVLS multicast support is not available on dev 1
user:852750:852802 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
user:852750:852802 [0] NCCL INFO NVLS multicast support is not available on dev 0
user:852753:852808 [3] NCCL INFO NVLS multicast support is not available on dev 3
user:852757:852804 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
user:852757:852804 [7] NCCL INFO NVLS multicast support is not available on dev 7
user:852752:852806 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
user:852752:852806 [2] NCCL INFO NVLS multicast support is not available on dev 2
user:852754:852809 [4] NCCL INFO NVLS multicast support is not available on dev 4
user:169132:169177 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
user:169132:169177 [7] NCCL INFO NVLS multicast support is not available on dev 7
user:169125:169181 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
user:169125:169181 [0] NCCL INFO NVLS multicast support is not available on dev 0
user:169129:169182 [4] NCCL INFO NVLS multicast support is not available on dev 4
user:169130:169180 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
user:169130:169180 [5] NCCL INFO NVLS multicast support is not available on dev 5
user:169127:169179 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
user:169127:169179 [2] NCCL INFO NVLS multicast support is not available on dev 2
user:852751:852807 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
user:852751:852807 [1] NCCL INFO P2P Chunksize set to 131072
user:852755:852803 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
user:852755:852803 [5] NCCL INFO P2P Chunksize set to 131072
user:852757:852804 [7] NCCL INFO Trees [0] 0/-1/-1->7->6 [1] 0/-1/-1->7->6
user:852757:852804 [7] NCCL INFO P2P Chunksize set to 131072
user:852750:852802 [0] NCCL INFO Channel 00/02 :    0   7   6   5   4   3   2   1   8   9  10  11  12  13  14  15
user:852750:852802 [0] NCCL INFO Channel 01/02 :    0   7   6   5   4   3   2   1   8   9  10  11  12  13  14  15
user:852750:852802 [0] NCCL INFO Trees [0] 1/-1/-1->0->7 [1] 1/-1/-1->0->7
user:852750:852802 [0] NCCL INFO P2P Chunksize set to 131072
user:852753:852808 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
user:852753:852808 [3] NCCL INFO P2P Chunksize set to 131072
user:852754:852809 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
user:852754:852809 [4] NCCL INFO P2P Chunksize set to 131072
user:852756:852805 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
user:852756:852805 [6] NCCL INFO P2P Chunksize set to 131072
user:852752:852806 [2] NCCL INFO Trees [0] 3/10/-1->2->-1 [1] 3/-1/-1->2->10
user:852752:852806 [2] NCCL INFO P2P Chunksize set to 131072
user:169128:169176 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10
user:169128:169176 [3] NCCL INFO P2P Chunksize set to 131072
user:169129:169182 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11
user:169129:169182 [4] NCCL INFO P2P Chunksize set to 131072
user:169132:169177 [7] NCCL INFO Trees [0] 8/-1/-1->15->14 [1] 8/-1/-1->15->14
user:169132:169177 [7] NCCL INFO P2P Chunksize set to 131072
user:169125:169181 [0] NCCL INFO Trees [0] 9/-1/-1->8->15 [1] 9/-1/-1->8->15
user:169125:169181 [0] NCCL INFO P2P Chunksize set to 131072
user:169127:169179 [2] NCCL INFO Trees [0] 11/-1/-1->10->2 [1] 11/2/-1->10->-1
user:169127:169179 [2] NCCL INFO P2P Chunksize set to 131072
user:169126:169178 [1] NCCL INFO Trees [0] -1/-1/-1->9->8 [1] -1/-1/-1->9->8
user:169126:169178 [1] NCCL INFO P2P Chunksize set to 131072
user:169130:169180 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12
user:169130:169180 [5] NCCL INFO P2P Chunksize set to 131072
user:169131:169183 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13
user:169131:169183 [6] NCCL INFO P2P Chunksize set to 131072
user:169125:169181 [0] NCCL INFO Channel 00/0 : 8[27000] -> 9[2a000] via P2P/IPC/read
user:169125:169181 [0] NCCL INFO Channel 01/0 : 8[27000] -> 9[2a000] via P2P/IPC/read
user:852752:852806 [2] NCCL INFO Channel 00/0 : 2[51000] -> 1[2a000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Channel 00/0 : 3[57000] -> 2[51000] via P2P/IPC/read
user:852755:852803 [5] NCCL INFO Channel 00/0 : 5[a4000] -> 4[9e000] via P2P/IPC/read
user:852752:852806 [2] NCCL INFO Channel 01/0 : 2[51000] -> 1[2a000] via P2P/IPC/read
user:852756:852805 [6] NCCL INFO Channel 00/0 : 6[c7000] -> 5[a4000] via P2P/IPC/read
user:852754:852809 [4] NCCL INFO Channel 00/0 : 4[9e000] -> 3[57000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Channel 01/0 : 3[57000] -> 2[51000] via P2P/IPC/read
user:852755:852803 [5] NCCL INFO Channel 01/0 : 5[a4000] -> 4[9e000] via P2P/IPC/read
user:852756:852805 [6] NCCL INFO Channel 01/0 : 6[c7000] -> 5[a4000] via P2P/IPC/read
user:852754:852809 [4] NCCL INFO Channel 01/0 : 4[9e000] -> 3[57000] via P2P/IPC/read
user:169130:169180 [5] NCCL INFO Channel 00/0 : 13[a4000] -> 14[c7000] via P2P/IPC/read
user:169129:169182 [4] NCCL INFO Channel 00/0 : 12[9e000] -> 13[a4000] via P2P/IPC/read
user:169128:169176 [3] NCCL INFO Channel 00/0 : 11[57000] -> 12[9e000] via P2P/IPC/read
user:169131:169183 [6] NCCL INFO Channel 00/0 : 14[c7000] -> 15[ca000] via P2P/IPC/read
user:169130:169180 [5] NCCL INFO Channel 01/0 : 13[a4000] -> 14[c7000] via P2P/IPC/read
user:169128:169176 [3] NCCL INFO Channel 01/0 : 11[57000] -> 12[9e000] via P2P/IPC/read
user:169129:169182 [4] NCCL INFO Channel 01/0 : 12[9e000] -> 13[a4000] via P2P/IPC/read
user:169131:169183 [6] NCCL INFO Channel 01/0 : 14[c7000] -> 15[ca000] via P2P/IPC/read
user:169126:169178 [1] NCCL INFO Channel 00/0 : 9[2a000] -> 10[51000] via P2P/IPC/read
user:852750:852802 [0] NCCL INFO Channel 00/0 : 15[ca000] -> 0[27000] [receive] via NET/Socket/0
user:852751:852807 [1] NCCL INFO Channel 00/0 : 1[2a000] -> 8[27000] [send] via NET/Socket/0
user:169126:169178 [1] NCCL INFO Channel 01/0 : 9[2a000] -> 10[51000] via P2P/IPC/read
user:852750:852802 [0] NCCL INFO Channel 01/0 : 15[ca000] -> 0[27000] [receive] via NET/Socket/0
user:852751:852807 [1] NCCL INFO Channel 01/0 : 1[2a000] -> 8[27000] [send] via NET/Socket/0
user:169127:169179 [2] NCCL INFO Channel 00/0 : 10[51000] -> 11[57000] via P2P/IPC/read
user:169132:169177 [7] NCCL INFO Channel 00/0 : 15[ca000] -> 0[27000] [send] via NET/Socket/3
user:169132:169177 [7] NCCL INFO Channel 01/0 : 15[ca000] -> 0[27000] [send] via NET/Socket/3
user:852750:852802 [0] NCCL INFO Channel 00/0 : 0[27000] -> 7[ca000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Connected all rings
user:852755:852803 [5] NCCL INFO Connected all rings
user:852754:852809 [4] NCCL INFO Connected all rings
user:169130:169180 [5] NCCL INFO Connected all rings
user:852750:852802 [0] NCCL INFO Channel 01/0 : 0[27000] -> 7[ca000] via P2P/IPC/read
user:169127:169179 [2] NCCL INFO Channel 01/0 : 10[51000] -> 11[57000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Channel 00/0 : 3[57000] -> 4[9e000] via P2P/IPC/read
user:852757:852804 [7] NCCL INFO Channel 00/0 : 7[ca000] -> 6[c7000] via P2P/IPC/read
user:852755:852803 [5] NCCL INFO Channel 00/0 : 5[a4000] -> 6[c7000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Channel 01/0 : 3[57000] -> 4[9e000] via P2P/IPC/read
user:852754:852809 [4] NCCL INFO Channel 00/0 : 4[9e000] -> 5[a4000] via P2P/IPC/read
user:169131:169183 [6] NCCL INFO Connected all rings
user:169130:169180 [5] NCCL INFO Channel 00/0 : 13[a4000] -> 12[9e000] via P2P/IPC/read
user:852757:852804 [7] NCCL INFO Channel 01/0 : 7[ca000] -> 6[c7000] via P2P/IPC/read
user:852754:852809 [4] NCCL INFO Channel 01/0 : 4[9e000] -> 5[a4000] via P2P/IPC/read
user:852755:852803 [5] NCCL INFO Channel 01/0 : 5[a4000] -> 6[c7000] via P2P/IPC/read
user:169125:169181 [0] NCCL INFO Channel 00/0 : 1[2a000] -> 8[27000] [receive] via NET/Socket/0
user:169130:169180 [5] NCCL INFO Channel 01/0 : 13[a4000] -> 12[9e000] via P2P/IPC/read
user:852756:852805 [6] NCCL INFO Connected all rings
user:852757:852804 [7] NCCL INFO Connected all rings
user:852754:852809 [4] NCCL INFO Connected all trees
user:852754:852809 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852754:852809 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169128:169176 [3] NCCL INFO Connected all rings
user:852756:852805 [6] NCCL INFO Channel 00/0 : 6[c7000] -> 7[ca000] via P2P/IPC/read
user:169127:169179 [2] NCCL INFO Connected all rings
user:852756:852805 [6] NCCL INFO Channel 01/0 : 6[c7000] -> 7[ca000] via P2P/IPC/read
user:852757:852804 [7] NCCL INFO Channel 00/0 : 7[ca000] -> 0[27000] via P2P/IPC/read
user:169131:169183 [6] NCCL INFO Channel 00/0 : 14[c7000] -> 13[a4000] via P2P/IPC/read
user:169126:169178 [1] NCCL INFO Connected all rings
user:169126:169178 [1] NCCL INFO Channel 00/0 : 9[2a000] -> 8[27000] via P2P/IPC/read
user:852757:852804 [7] NCCL INFO Channel 01/0 : 7[ca000] -> 0[27000] via P2P/IPC/read
user:169129:169182 [4] NCCL INFO Connected all rings
user:852755:852803 [5] NCCL INFO Connected all trees
user:852755:852803 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852755:852803 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169129:169182 [4] NCCL INFO Channel 00/0 : 12[9e000] -> 11[57000] via P2P/IPC/read
user:852756:852805 [6] NCCL INFO Connected all trees
user:852756:852805 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852756:852805 [6] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169129:169182 [4] NCCL INFO Channel 01/0 : 12[9e000] -> 11[57000] via P2P/IPC/read
user:169125:169181 [0] NCCL INFO Channel 01/0 : 1[2a000] -> 8[27000] [receive] via NET/Socket/0
user:169131:169183 [6] NCCL INFO Channel 01/0 : 14[c7000] -> 13[a4000] via P2P/IPC/read
user:169127:169179 [2] NCCL INFO Channel 00/0 : 2[51000] -> 10[51000] [receive] via NET/Socket/0
user:169126:169178 [1] NCCL INFO Channel 01/0 : 9[2a000] -> 8[27000] via P2P/IPC/read
user:169127:169179 [2] NCCL INFO Channel 01/0 : 2[51000] -> 10[51000] [receive] via NET/Socket/0
user:169128:169176 [3] NCCL INFO Channel 00/0 : 11[57000] -> 10[51000] via P2P/IPC/read
user:169128:169176 [3] NCCL INFO Channel 01/0 : 11[57000] -> 10[51000] via P2P/IPC/read
user:169127:169179 [2] NCCL INFO Channel 00/0 : 10[51000] -> 2[51000] [send] via NET/Socket/0
user:169127:169179 [2] NCCL INFO Channel 01/0 : 10[51000] -> 2[51000] [send] via NET/Socket/0
user:852752:852806 [2] NCCL INFO Connected all rings
user:852752:852806 [2] NCCL INFO Channel 00/0 : 2[51000] -> 3[57000] via P2P/IPC/read
user:169132:169177 [7] NCCL INFO Connected all rings
user:852752:852806 [2] NCCL INFO Channel 01/0 : 2[51000] -> 3[57000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Connected all trees
user:852753:852808 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852753:852808 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852750:852802 [0] NCCL INFO Connected all rings
user:852750:852802 [0] NCCL INFO Channel 00/0 : 0[27000] -> 1[2a000] via P2P/IPC/read
user:169130:169180 [5] NCCL INFO Connected all trees
user:169130:169180 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169130:169180 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169129:169182 [4] NCCL INFO Connected all trees
user:169129:169182 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169129:169182 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852752:852806 [2] NCCL INFO Channel 00/0 : 10[51000] -> 2[51000] [receive] via NET/Socket/0
user:169125:169181 [0] NCCL INFO Connected all rings
user:169125:169181 [0] NCCL INFO Channel 00/0 : 8[27000] -> 15[ca000] via P2P/IPC/read
user:169125:169181 [0] NCCL INFO Channel 01/0 : 8[27000] -> 15[ca000] via P2P/IPC/read
user:852751:852807 [1] NCCL INFO Connected all rings
user:852750:852802 [0] NCCL INFO Channel 01/0 : 0[27000] -> 1[2a000] via P2P/IPC/read
user:852751:852807 [1] NCCL INFO Channel 00/0 : 1[2a000] -> 0[27000] via P2P/IPC/read
user:852751:852807 [1] NCCL INFO Channel 01/0 : 1[2a000] -> 0[27000] via P2P/IPC/read
user:852752:852806 [2] NCCL INFO Channel 01/0 : 10[51000] -> 2[51000] [receive] via NET/Socket/0
user:852757:852804 [7] NCCL INFO Connected all trees
user:852757:852804 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852757:852804 [7] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852752:852806 [2] NCCL INFO Channel 00/0 : 2[51000] -> 10[51000] [send] via NET/Socket/0
user:852752:852806 [2] NCCL INFO Channel 01/0 : 2[51000] -> 10[51000] [send] via NET/Socket/0
user:852751:852807 [1] NCCL INFO Connected all trees
user:852751:852807 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852751:852807 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852750:852802 [0] NCCL INFO Connected all trees
user:852750:852802 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852750:852802 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169128:169176 [3] NCCL INFO Connected all trees
user:169128:169176 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169128:169176 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852752:852806 [2] NCCL INFO Connected all trees
user:852752:852806 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852752:852806 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169132:169177 [7] NCCL INFO Channel 00/0 : 15[ca000] -> 8[27000] via P2P/IPC/read
user:169132:169177 [7] NCCL INFO Channel 01/0 : 15[ca000] -> 8[27000] via P2P/IPC/read
user:169132:169177 [7] NCCL INFO Channel 00/0 : 15[ca000] -> 14[c7000] via P2P/IPC/read
user:169132:169177 [7] NCCL INFO Channel 01/0 : 15[ca000] -> 14[c7000] via P2P/IPC/read
user:169126:169178 [1] NCCL INFO Connected all trees
user:169126:169178 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169126:169178 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169125:169181 [0] NCCL INFO Connected all trees
user:169125:169181 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169125:169181 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169131:169183 [6] NCCL INFO Connected all trees
user:169131:169183 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169131:169183 [6] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169132:169177 [7] NCCL INFO Connected all trees
user:169132:169177 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169132:169177 [7] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852757:852804 [7] NCCL INFO comm 0x558c2343afa0 rank 7 nranks 16 cudaDev 7 busId ca000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852755:852803 [5] NCCL INFO comm 0x5591311fb270 rank 5 nranks 16 cudaDev 5 busId a4000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852754:852809 [4] NCCL INFO comm 0x555c22882030 rank 4 nranks 16 cudaDev 4 busId 9e000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852752:852806 [2] NCCL INFO comm 0x564976185500 rank 2 nranks 16 cudaDev 2 busId 51000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852753:852808 [3] NCCL INFO comm 0x5591b0312e30 rank 3 nranks 16 cudaDev 3 busId 57000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852756:852805 [6] NCCL INFO comm 0x563073d8e380 rank 6 nranks 16 cudaDev 6 busId c7000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852751:852807 [1] NCCL INFO comm 0x55f7e8a93c40 rank 1 nranks 16 cudaDev 1 busId 2a000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852750:852802 [0] NCCL INFO comm 0x5564a97c5bf0 rank 0 nranks 16 cudaDev 0 busId 27000 commId 0x86d61c6d254b547e - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
user:169127:169179 [2] NCCL INFO Connected all trees
user:169127:169179 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169127:169179 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169128:169176 [3] NCCL INFO comm 0x55728d3d7a70 rank 11 nranks 16 cudaDev 3 busId 57000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169131:169183 [6] NCCL INFO comm 0x56379ea7ca40 rank 14 nranks 16 cudaDev 6 busId c7000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169125:169181 [0] NCCL INFO comm 0x5579a287c1f0 rank 8 nranks 16 cudaDev 0 busId 27000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169127:169179 [2] NCCL INFO comm 0x559424090100 rank 10 nranks 16 cudaDev 2 busId 51000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169129:169182 [4] NCCL INFO comm 0x55f294ca5760 rank 12 nranks 16 cudaDev 4 busId 9e000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169126:169178 [1] NCCL INFO comm 0x55b843fe99f0 rank 9 nranks 16 cudaDev 1 busId 2a000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169130:169180 [5] NCCL INFO comm 0x564b53575070 rank 13 nranks 16 cudaDev 5 busId a4000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169132:169177 [7] NCCL INFO comm 0x56412adcd570 rank 15 nranks 16 cudaDev 7 busId ca000 commId 0x86d61c6d254b547e - Init COMPLETE
   134217728      33554432     float     sum      -1    42497    3.16    5.92      0    41758    3.21    6.03      0
   268435456      67108864     float     sum      -1    86999    3.09    5.79      0    86767    3.09    5.80      0
   536870912     134217728     float     sum      -1   175858    3.05    5.72      0   172967    3.10    5.82      0
user:852755:852755 [5] NCCL INFO comm 0x5591311fb270 rank 5 nranks 16 cudaDev 5 busId a4000 - Destroy COMPLETE
user:169126:169126 [1] NCCL INFO comm 0x55b843fe99f0 rank 9 nranks 16 cudaDev 1 busId 2a000 - Destroy COMPLETE
user:169128:169128 [3] NCCL INFO comm 0x55728d3d7a70 rank 11 nranks 16 cudaDev 3 busId 57000 - Destroy COMPLETE
user:169130:169130 [5] NCCL INFO comm 0x564b53575070 rank 13 nranks 16 cudaDev 5 busId a4000 - Destroy COMPLETE
user:169129:169129 [4] NCCL INFO comm 0x55f294ca5760 rank 12 nranks 16 cudaDev 4 busId 9e000 - Destroy COMPLETE
user:852756:852756 [6] NCCL INFO comm 0x563073d8e380 rank 6 nranks 16 cudaDev 6 busId c7000 - Destroy COMPLETE
user:852753:852753 [3] NCCL INFO comm 0x5591b0312e30 rank 3 nranks 16 cudaDev 3 busId 57000 - Destroy COMPLETE
user:169132:169132 [7] NCCL INFO comm 0x56412adcd570 rank 15 nranks 16 cudaDev 7 busId ca000 - Destroy COMPLETE
user:852757:852757 [7] NCCL INFO comm 0x558c2343afa0 rank 7 nranks 16 cudaDev 7 busId ca000 - Destroy COMPLETE
user:852751:852751 [1] NCCL INFO comm 0x55f7e8a93c40 rank 1 nranks 16 cudaDev 1 busId 2a000 - Destroy COMPLETE
user:169131:169131 [6] NCCL INFO comm 0x56379ea7ca40 rank 14 nranks 16 cudaDev 6 busId c7000 - Destroy COMPLETE
user:169127:169127 [2] NCCL INFO comm 0x559424090100 rank 10 nranks 16 cudaDev 2 busId 51000 - Destroy COMPLETE
user:852750:852750 [0] NCCL INFO comm 0x5564a97c5bf0 rank 0 nranks 16 cudaDev 0 busId 27000 - Destroy COMPLETE
user:852752:852752 [2] NCCL INFO comm 0x564976185500 rank 2 nranks 16 cudaDev 2 busId 51000 - Destroy COMPLETE
user:169125:169125 [0] NCCL INFO comm 0x5579a287c1f0 rank 8 nranks 16 cudaDev 0 busId 27000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 5.8464 
#
user:852754:852754 [4] NCCL INFO comm 0x555c22882030 rank 4 nranks 16 cudaDev 4 busId 9e000 - Destroy COMPLETE
sjeaugey commented 7 months ago

All I can see is that on one of the nodes you get:

user:16905:16951 [6] NCCL INFO NET/IB : No device found.

That tends to indicate the IB verbs library (libibverbs.so) is missing, or the interfaces are not forwarded to the container (if using a container). You should run ibv_devinfo to check the interfaces are up and running.

On that same node you see to have IP over IB interfaces though:

user:16905:16951 [6] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>

so sockets will use that and get better than 100MBps.

MiyazonoKaori commented 7 months ago

I have fixed this issue, thanks @sjeaugey

reinstall MLNX_OFED

After starting up, enter the command /etc/init.d/openibd restart systemctl restart opensmd