Closed MiyazonoKaori closed 7 months ago
This is basically two ranks complaining that they are not using the same transport (one is using socket, the other is using IB). You can see that as some ranks are in net_ib.cc when others are in net_socket.cc.
@sjeaugey How should I fix this error? Modify the environment variables? Or reinstall nccl? This is my network environment.
` root@user:/home/user# ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.39.1002 Hardware version: 0 Node GUID: 0xe8ebd30300229550 System image GUID: 0xe8ebd30300229550 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 9 LMC: 0 SM lid: 9 Capability mask: 0xa651e84a Port GUID: 0xe8ebd30300229550 Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.39.1002 Hardware version: 0 Node GUID: 0xe8ebd30300229551 System image GUID: 0xe8ebd30300229550 Port 1: State: Down Physical state: Disabled Rate: 10 Base lid: 65535 LMC: 0 SM lid: 0 Capability mask: 0xa651e848 Port GUID: 0xe8ebd30300229551 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.39.1002 Hardware version: 0 Node GUID: 0xb83fd203001ed0e4 System image GUID: 0xb83fd203001ed0e4 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 10 LMC: 0 SM lid: 9 Capability mask: 0xa651e848 Port GUID: 0xb83fd203001ed0e4 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.39.1002 Hardware version: 0 Node GUID: 0xb83fd203001ed0e5 System image GUID: 0xb83fd203001ed0e4 Port 1: State: Down Physical state: Disabled Rate: 10 Base lid: 65535 LMC: 0 SM lid: 0 Capability mask: 0xa651e848 Port GUID: 0xb83fd203001ed0e5 Link layer: InfiniBand CA 'mlx5_4' CA type: MT4117 Number of ports: 1 Firmware version: 14.32.1010 Hardware version: 0 Node GUID: 0xb83fd20300283fda System image GUID: 0xb83fd20300283fda Port 1: State: Active Physical state: LinkUp Rate: 2.5 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xba3fd2fffe283fda Link layer: Ethernet CA 'mlx5_5' CA type: MT4117 Number of ports: 1 Firmware version: 14.32.1010 Hardware version: 0 Node GUID: 0xb83fd20300283fdb System image GUID: 0xb83fd20300283fda Port 1: State: Down Physical state: Disabled Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xba3fd2fffe283fdb Link layer: Ethernet root@user:/home/user# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 4: usb0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether e6:cf:fa:a4:5b:58 brd ff:ff:ff:ff:ff:ff 27: ens97f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether b8:3f:d2:28:3f:da brd ff:ff:ff:ff:ff:ff inet 10.42.45.2/16 brd 10.42.255.255 scope global ens97f0np0 valid_lft forever preferred_lft forever inet6 fe80::ba3f:d2ff:fe28:3fda/64 scope link valid_lft forever preferred_lft forever 28: ens97f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:3f:d2:28:3f:db brd ff:ff:ff:ff:ff:ff 29: ibs85f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256 link/infiniband 00:00:06:8b:fe:80:00:00:00:00:00:00:e8:eb:d3:03:00:22:95:50 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 192.168.1.14/24 brd 192.168.1.255 scope global ibs85f0 valid_lft forever preferred_lft forever inet6 fe80::eaeb:d303:22:9550/64 scope link valid_lft forever preferred_lft forever 30: ibs85f1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256 link/infiniband 00:00:11:49:fe:80:00:00:00:00:00:00:e8:eb:d3:03:00:22:95:51 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 31: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256 link/infiniband 00:00:10:49:fe:80:00:00:00:00:00:00:b8:3f:d2:03:00:1e:d0:e4 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 192.168.1.15/24 brd 192.168.1.255 scope global ib0 valid_lft forever preferred_lft forever inet6 fe80::ba3f:d203:1e:d0e4/64 scope link valid_lft forever preferred_lft forever 32: ib1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256 link/infiniband 00:00:11:49:fe:80:00:00:00:00:00:00:b8:3f:d2:03:00:1e:d0:e5 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 33: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:73:55:e8:80 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:73ff:fe55:e880/64 scope link valid_lft forever preferred_lft forever 35: veth3bb818d@if34: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default link/ether da:33:f7:94:b5:52 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::d833:f7ff:fe94:b552/64 scope link valid_lft forever preferred_lft forever
~/.bashrc
export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export CUDA_HOME=/usr/local/cuda
export MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi export OMPI_ALLOW_RUN_AS_ROOT=1 export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export NCCL_IB_DISABLE=0 export NCCL_IB_HCA=mlx5_0:9,mlx5_2:10 export NCCL_DEBUG=INFO export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
`
Look like you want to use Infiniband, then make sure your Infiniband setup is working on both nodes. Otherwise you could set NCCL_IB_DISABLE=1 to use sockets but it will be much slower.
@sjeaugey Yes, I want to use Infiniband. Using ibping to test, the two nodes are connected. When I set NCCL_IB_DISABLE=1, the nccl-test works fine, but strangely, its bandwidth is much faster than that of the fiber network (100MB/s) and yet slower than IB's bandwidth (20GB/s). This is very confusing for me, and I don't know what the problem is or how to fix it. Thank you for your help.
ibping:
node1:
root@user:/home/user#
root@user:/home/user#
root@user:/home/user# sudo ibping -S -C mlx5_0 -P 1
^C
root@user:/home/user# sudo ibping -S -C mlx5_2 -P 1
^C
root@user:/home/user# sudo ibping -S -C mlx5_6 -P 1
^C
root@user:/home/user# sudo ibping -S -C mlx5_8 -P 1
^C
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 9
--- user.(none) (Lid 9) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7036 ms
rtt min/avg/max = 0.031/0.703/900.078 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 9
--- user.(none) (Lid 9) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7922 ms
rtt min/avg/max = 0.021/0.792/900.080 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_6 -P 1 -L 9
--- user.(none) (Lid 9) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7940 ms
rtt min/avg/max = 0.032/0.793/900.080 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_8 -P 1 -L 9
--- user.(none) (Lid 9) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7938 ms
rtt min/avg/max = 0.033/0.793/900.078 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 10
--- user.(none) (Lid 10) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7046 ms
rtt min/avg/max = 0.031/0.704/900.080 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 10
--- user.(none) (Lid 10) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7931 ms
rtt min/avg/max = 0.026/0.793/900.078 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_6 -P 1 -L 10
--- user.(none) (Lid 10) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7981 ms
rtt min/avg/max = 0.027/0.798/900.085 ms
root@user:/home/user# sudo ibping -c 10000 -f -C mlx5_8 -P 1 -L 10
--- user.(none) (Lid 10) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7977 ms
rtt min/avg/max = 0.032/0.797/900.079 ms
root@user:/home/user#
node2:
root@user:/home/nccl-tests-master#
root@user:/home/nccl-tests-master#
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 1
--- user.(none) (Lid 1) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7967 ms
rtt min/avg/max = 0.034/0.796/900.086 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 1
--- user.(none) (Lid 1) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7953 ms
rtt min/avg/max = 0.033/0.795/900.086 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 2
--- user.(none) (Lid 2) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7063 ms
rtt min/avg/max = 0.032/0.706/900.084 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 2
--- user.(none) (Lid 2) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7064 ms
rtt min/avg/max = 0.034/0.706/900.081 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 3
--- user.(none) (Lid 3) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7072 ms
rtt min/avg/max = 0.031/0.707/900.080 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 3
--- user.(none) (Lid 3) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7067 ms
rtt min/avg/max = 0.025/0.706/900.082 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_2 -P 1 -L 4
--- user.(none) (Lid 4) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7973 ms
rtt min/avg/max = 0.020/0.797/900.086 ms
root@user:/home/nccl-tests-master# sudo ibping -c 10000 -f -C mlx5_0 -P 1 -L 4
--- user.(none) (Lid 4) ibping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 7067 ms
rtt min/avg/max = 0.031/0.706/900.087 ms
root@user:/home/nccl-tests-master#
root@user:/home/nccl-tests-master#
root@user:/home/nccl-tests-master# sudo ibping -S -C mlx5_0 -P 1
^C
root@user:/home/nccl-tests-master# sudo ibping -S -C mlx5_2 -P 1
^C
root@user:/home/nccl-tests-master#
NCCL_IB_DISABLE=1 detailed log:
root@user:/home/nccl-tests-master# mpirun --allow-run-as-root -np 16 --hostfile mpi_hosts -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=1 ./build/all_reduce_perf -b 128M -e 512M -f 2
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: user
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: user
Local device: mlx5_0
--------------------------------------------------------------------------
# nThread 1 nGpus 1 minBytes 134217728 maxBytes 536870912 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 852750 on user device 0 [0x27] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 852751 on user device 1 [0x2a] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 852752 on user device 2 [0x51] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 852753 on user device 3 [0x57] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 852754 on user device 4 [0x9e] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 852755 on user device 5 [0xa4] NVIDIA A100-SXM4-80GB
# Rank 6 Group 0 Pid 852756 on user device 6 [0xc7] NVIDIA A100-SXM4-80GB
# Rank 7 Group 0 Pid 852757 on user device 7 [0xca] NVIDIA A100-SXM4-80GB
# Rank 8 Group 0 Pid 169125 on user device 0 [0x27] NVIDIA A100-SXM4-80GB
# Rank 9 Group 0 Pid 169126 on user device 1 [0x2a] NVIDIA A100-SXM4-80GB
# Rank 10 Group 0 Pid 169127 on user device 2 [0x51] NVIDIA A100-SXM4-80GB
# Rank 11 Group 0 Pid 169128 on user device 3 [0x57] NVIDIA A100-SXM4-80GB
# Rank 12 Group 0 Pid 169129 on user device 4 [0x9e] NVIDIA A100-SXM4-80GB
# Rank 13 Group 0 Pid 169130 on user device 5 [0xa4] NVIDIA A100-SXM4-80GB
# Rank 14 Group 0 Pid 169131 on user device 6 [0xc7] NVIDIA A100-SXM4-80GB
# Rank 15 Group 0 Pid 169132 on user device 7 [0xca] NVIDIA A100-SXM4-80GB
user:852750:852750 [0] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852750:852750 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852750:852750 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852750:852750 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.1+cuda12.1
user:852755:852755 [5] NCCL INFO cudaDriverVersion 12020
user:852755:852755 [5] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852755:852755 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852755:852755 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852757:852757 [7] NCCL INFO cudaDriverVersion 12020
user:852757:852757 [7] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852757:852757 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852757:852757 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852756:852756 [6] NCCL INFO cudaDriverVersion 12020
user:852756:852756 [6] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852756:852756 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852756:852756 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852752:852752 [2] NCCL INFO cudaDriverVersion 12020
user:852752:852752 [2] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852752:852752 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852752:852752 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852751:852751 [1] NCCL INFO cudaDriverVersion 12020
user:852751:852751 [1] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852751:852751 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852751:852751 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:852753:852753 [3] NCCL INFO cudaDriverVersion 12020
user:852753:852753 [3] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852753:852753 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852753:852753 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169128:169128 [3] NCCL INFO cudaDriverVersion 12020
user:852754:852754 [4] NCCL INFO cudaDriverVersion 12020
user:852754:852754 [4] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.14<0>
user:852754:852754 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:852754:852754 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169128:169128 [3] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169128:169128 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169128:169128 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169126:169126 [1] NCCL INFO cudaDriverVersion 12020
user:169126:169126 [1] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169126:169126 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169126:169126 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169132:169132 [7] NCCL INFO cudaDriverVersion 12020
user:169132:169132 [7] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169132:169132 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169132:169132 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169127:169127 [2] NCCL INFO cudaDriverVersion 12020
user:169127:169127 [2] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169127:169127 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169127:169127 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169130:169130 [5] NCCL INFO cudaDriverVersion 12020
user:169130:169130 [5] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169130:169130 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169130:169130 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169125:169125 [0] NCCL INFO cudaDriverVersion 12020
user:169125:169125 [0] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169125:169125 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169125:169125 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169129:169129 [4] NCCL INFO cudaDriverVersion 12020
user:169129:169129 [4] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169129:169129 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169129:169129 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:169131:169131 [6] NCCL INFO cudaDriverVersion 12020
user:169131:169131 [6] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:169131:169131 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:169131:169131 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
[user:852715] 15 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[user:852715] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[user:852715] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init
user:852750:852802 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852750:852802 [0] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852750:852802 [0] NCCL INFO Using network Socket
user:852751:852807 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852751:852807 [1] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852751:852807 [1] NCCL INFO Using network Socket
user:852753:852808 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852753:852808 [3] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852753:852808 [3] NCCL INFO Using network Socket
user:852757:852804 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852757:852804 [7] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852757:852804 [7] NCCL INFO Using network Socket
user:852755:852803 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852755:852803 [5] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852755:852803 [5] NCCL INFO Using network Socket
user:169126:169178 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169126:169178 [1] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169126:169178 [1] NCCL INFO Using network Socket
user:169132:169177 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169132:169177 [7] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169132:169177 [7] NCCL INFO Using network Socket
user:169130:169180 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169130:169180 [5] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169130:169180 [5] NCCL INFO Using network Socket
user:169127:169179 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169127:169179 [2] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169127:169179 [2] NCCL INFO Using network Socket
user:169128:169176 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169128:169176 [3] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169128:169176 [3] NCCL INFO Using network Socket
user:852752:852806 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852752:852806 [2] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852752:852806 [2] NCCL INFO Using network Socket
user:169125:169181 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169125:169181 [0] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169125:169181 [0] NCCL INFO Using network Socket
user:852754:852809 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852754:852809 [4] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852754:852809 [4] NCCL INFO Using network Socket
user:169129:169182 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169129:169182 [4] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169129:169182 [4] NCCL INFO Using network Socket
user:852756:852805 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:852756:852805 [6] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.14<0> [1]ib0:192.168.1.15<0>
user:852756:852805 [6] NCCL INFO Using network Socket
user:169131:169183 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
user:169131:169183 [6] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
user:169131:169183 [6] NCCL INFO Using network Socket
user:169128:169176 [3] NCCL INFO NVLS multicast support is not available on dev 3
user:169131:169183 [6] NCCL INFO NVLS multicast support is not available on dev 6
user:852756:852805 [6] NCCL INFO NVLS multicast support is not available on dev 6
user:169126:169178 [1] NCCL INFO NVLS multicast support is not available on dev 1
user:852755:852803 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
user:852755:852803 [5] NCCL INFO NVLS multicast support is not available on dev 5
user:852751:852807 [1] NCCL INFO NVLS multicast support is not available on dev 1
user:852750:852802 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
user:852750:852802 [0] NCCL INFO NVLS multicast support is not available on dev 0
user:852753:852808 [3] NCCL INFO NVLS multicast support is not available on dev 3
user:852757:852804 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
user:852757:852804 [7] NCCL INFO NVLS multicast support is not available on dev 7
user:852752:852806 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
user:852752:852806 [2] NCCL INFO NVLS multicast support is not available on dev 2
user:852754:852809 [4] NCCL INFO NVLS multicast support is not available on dev 4
user:169132:169177 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
user:169132:169177 [7] NCCL INFO NVLS multicast support is not available on dev 7
user:169125:169181 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
user:169125:169181 [0] NCCL INFO NVLS multicast support is not available on dev 0
user:169129:169182 [4] NCCL INFO NVLS multicast support is not available on dev 4
user:169130:169180 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
user:169130:169180 [5] NCCL INFO NVLS multicast support is not available on dev 5
user:169127:169179 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
user:169127:169179 [2] NCCL INFO NVLS multicast support is not available on dev 2
user:852751:852807 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
user:852751:852807 [1] NCCL INFO P2P Chunksize set to 131072
user:852755:852803 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
user:852755:852803 [5] NCCL INFO P2P Chunksize set to 131072
user:852757:852804 [7] NCCL INFO Trees [0] 0/-1/-1->7->6 [1] 0/-1/-1->7->6
user:852757:852804 [7] NCCL INFO P2P Chunksize set to 131072
user:852750:852802 [0] NCCL INFO Channel 00/02 : 0 7 6 5 4 3 2 1 8 9 10 11 12 13 14 15
user:852750:852802 [0] NCCL INFO Channel 01/02 : 0 7 6 5 4 3 2 1 8 9 10 11 12 13 14 15
user:852750:852802 [0] NCCL INFO Trees [0] 1/-1/-1->0->7 [1] 1/-1/-1->0->7
user:852750:852802 [0] NCCL INFO P2P Chunksize set to 131072
user:852753:852808 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
user:852753:852808 [3] NCCL INFO P2P Chunksize set to 131072
user:852754:852809 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
user:852754:852809 [4] NCCL INFO P2P Chunksize set to 131072
user:852756:852805 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
user:852756:852805 [6] NCCL INFO P2P Chunksize set to 131072
user:852752:852806 [2] NCCL INFO Trees [0] 3/10/-1->2->-1 [1] 3/-1/-1->2->10
user:852752:852806 [2] NCCL INFO P2P Chunksize set to 131072
user:169128:169176 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10
user:169128:169176 [3] NCCL INFO P2P Chunksize set to 131072
user:169129:169182 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11
user:169129:169182 [4] NCCL INFO P2P Chunksize set to 131072
user:169132:169177 [7] NCCL INFO Trees [0] 8/-1/-1->15->14 [1] 8/-1/-1->15->14
user:169132:169177 [7] NCCL INFO P2P Chunksize set to 131072
user:169125:169181 [0] NCCL INFO Trees [0] 9/-1/-1->8->15 [1] 9/-1/-1->8->15
user:169125:169181 [0] NCCL INFO P2P Chunksize set to 131072
user:169127:169179 [2] NCCL INFO Trees [0] 11/-1/-1->10->2 [1] 11/2/-1->10->-1
user:169127:169179 [2] NCCL INFO P2P Chunksize set to 131072
user:169126:169178 [1] NCCL INFO Trees [0] -1/-1/-1->9->8 [1] -1/-1/-1->9->8
user:169126:169178 [1] NCCL INFO P2P Chunksize set to 131072
user:169130:169180 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12
user:169130:169180 [5] NCCL INFO P2P Chunksize set to 131072
user:169131:169183 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13
user:169131:169183 [6] NCCL INFO P2P Chunksize set to 131072
user:169125:169181 [0] NCCL INFO Channel 00/0 : 8[27000] -> 9[2a000] via P2P/IPC/read
user:169125:169181 [0] NCCL INFO Channel 01/0 : 8[27000] -> 9[2a000] via P2P/IPC/read
user:852752:852806 [2] NCCL INFO Channel 00/0 : 2[51000] -> 1[2a000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Channel 00/0 : 3[57000] -> 2[51000] via P2P/IPC/read
user:852755:852803 [5] NCCL INFO Channel 00/0 : 5[a4000] -> 4[9e000] via P2P/IPC/read
user:852752:852806 [2] NCCL INFO Channel 01/0 : 2[51000] -> 1[2a000] via P2P/IPC/read
user:852756:852805 [6] NCCL INFO Channel 00/0 : 6[c7000] -> 5[a4000] via P2P/IPC/read
user:852754:852809 [4] NCCL INFO Channel 00/0 : 4[9e000] -> 3[57000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Channel 01/0 : 3[57000] -> 2[51000] via P2P/IPC/read
user:852755:852803 [5] NCCL INFO Channel 01/0 : 5[a4000] -> 4[9e000] via P2P/IPC/read
user:852756:852805 [6] NCCL INFO Channel 01/0 : 6[c7000] -> 5[a4000] via P2P/IPC/read
user:852754:852809 [4] NCCL INFO Channel 01/0 : 4[9e000] -> 3[57000] via P2P/IPC/read
user:169130:169180 [5] NCCL INFO Channel 00/0 : 13[a4000] -> 14[c7000] via P2P/IPC/read
user:169129:169182 [4] NCCL INFO Channel 00/0 : 12[9e000] -> 13[a4000] via P2P/IPC/read
user:169128:169176 [3] NCCL INFO Channel 00/0 : 11[57000] -> 12[9e000] via P2P/IPC/read
user:169131:169183 [6] NCCL INFO Channel 00/0 : 14[c7000] -> 15[ca000] via P2P/IPC/read
user:169130:169180 [5] NCCL INFO Channel 01/0 : 13[a4000] -> 14[c7000] via P2P/IPC/read
user:169128:169176 [3] NCCL INFO Channel 01/0 : 11[57000] -> 12[9e000] via P2P/IPC/read
user:169129:169182 [4] NCCL INFO Channel 01/0 : 12[9e000] -> 13[a4000] via P2P/IPC/read
user:169131:169183 [6] NCCL INFO Channel 01/0 : 14[c7000] -> 15[ca000] via P2P/IPC/read
user:169126:169178 [1] NCCL INFO Channel 00/0 : 9[2a000] -> 10[51000] via P2P/IPC/read
user:852750:852802 [0] NCCL INFO Channel 00/0 : 15[ca000] -> 0[27000] [receive] via NET/Socket/0
user:852751:852807 [1] NCCL INFO Channel 00/0 : 1[2a000] -> 8[27000] [send] via NET/Socket/0
user:169126:169178 [1] NCCL INFO Channel 01/0 : 9[2a000] -> 10[51000] via P2P/IPC/read
user:852750:852802 [0] NCCL INFO Channel 01/0 : 15[ca000] -> 0[27000] [receive] via NET/Socket/0
user:852751:852807 [1] NCCL INFO Channel 01/0 : 1[2a000] -> 8[27000] [send] via NET/Socket/0
user:169127:169179 [2] NCCL INFO Channel 00/0 : 10[51000] -> 11[57000] via P2P/IPC/read
user:169132:169177 [7] NCCL INFO Channel 00/0 : 15[ca000] -> 0[27000] [send] via NET/Socket/3
user:169132:169177 [7] NCCL INFO Channel 01/0 : 15[ca000] -> 0[27000] [send] via NET/Socket/3
user:852750:852802 [0] NCCL INFO Channel 00/0 : 0[27000] -> 7[ca000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Connected all rings
user:852755:852803 [5] NCCL INFO Connected all rings
user:852754:852809 [4] NCCL INFO Connected all rings
user:169130:169180 [5] NCCL INFO Connected all rings
user:852750:852802 [0] NCCL INFO Channel 01/0 : 0[27000] -> 7[ca000] via P2P/IPC/read
user:169127:169179 [2] NCCL INFO Channel 01/0 : 10[51000] -> 11[57000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Channel 00/0 : 3[57000] -> 4[9e000] via P2P/IPC/read
user:852757:852804 [7] NCCL INFO Channel 00/0 : 7[ca000] -> 6[c7000] via P2P/IPC/read
user:852755:852803 [5] NCCL INFO Channel 00/0 : 5[a4000] -> 6[c7000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Channel 01/0 : 3[57000] -> 4[9e000] via P2P/IPC/read
user:852754:852809 [4] NCCL INFO Channel 00/0 : 4[9e000] -> 5[a4000] via P2P/IPC/read
user:169131:169183 [6] NCCL INFO Connected all rings
user:169130:169180 [5] NCCL INFO Channel 00/0 : 13[a4000] -> 12[9e000] via P2P/IPC/read
user:852757:852804 [7] NCCL INFO Channel 01/0 : 7[ca000] -> 6[c7000] via P2P/IPC/read
user:852754:852809 [4] NCCL INFO Channel 01/0 : 4[9e000] -> 5[a4000] via P2P/IPC/read
user:852755:852803 [5] NCCL INFO Channel 01/0 : 5[a4000] -> 6[c7000] via P2P/IPC/read
user:169125:169181 [0] NCCL INFO Channel 00/0 : 1[2a000] -> 8[27000] [receive] via NET/Socket/0
user:169130:169180 [5] NCCL INFO Channel 01/0 : 13[a4000] -> 12[9e000] via P2P/IPC/read
user:852756:852805 [6] NCCL INFO Connected all rings
user:852757:852804 [7] NCCL INFO Connected all rings
user:852754:852809 [4] NCCL INFO Connected all trees
user:852754:852809 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852754:852809 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169128:169176 [3] NCCL INFO Connected all rings
user:852756:852805 [6] NCCL INFO Channel 00/0 : 6[c7000] -> 7[ca000] via P2P/IPC/read
user:169127:169179 [2] NCCL INFO Connected all rings
user:852756:852805 [6] NCCL INFO Channel 01/0 : 6[c7000] -> 7[ca000] via P2P/IPC/read
user:852757:852804 [7] NCCL INFO Channel 00/0 : 7[ca000] -> 0[27000] via P2P/IPC/read
user:169131:169183 [6] NCCL INFO Channel 00/0 : 14[c7000] -> 13[a4000] via P2P/IPC/read
user:169126:169178 [1] NCCL INFO Connected all rings
user:169126:169178 [1] NCCL INFO Channel 00/0 : 9[2a000] -> 8[27000] via P2P/IPC/read
user:852757:852804 [7] NCCL INFO Channel 01/0 : 7[ca000] -> 0[27000] via P2P/IPC/read
user:169129:169182 [4] NCCL INFO Connected all rings
user:852755:852803 [5] NCCL INFO Connected all trees
user:852755:852803 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852755:852803 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169129:169182 [4] NCCL INFO Channel 00/0 : 12[9e000] -> 11[57000] via P2P/IPC/read
user:852756:852805 [6] NCCL INFO Connected all trees
user:852756:852805 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852756:852805 [6] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169129:169182 [4] NCCL INFO Channel 01/0 : 12[9e000] -> 11[57000] via P2P/IPC/read
user:169125:169181 [0] NCCL INFO Channel 01/0 : 1[2a000] -> 8[27000] [receive] via NET/Socket/0
user:169131:169183 [6] NCCL INFO Channel 01/0 : 14[c7000] -> 13[a4000] via P2P/IPC/read
user:169127:169179 [2] NCCL INFO Channel 00/0 : 2[51000] -> 10[51000] [receive] via NET/Socket/0
user:169126:169178 [1] NCCL INFO Channel 01/0 : 9[2a000] -> 8[27000] via P2P/IPC/read
user:169127:169179 [2] NCCL INFO Channel 01/0 : 2[51000] -> 10[51000] [receive] via NET/Socket/0
user:169128:169176 [3] NCCL INFO Channel 00/0 : 11[57000] -> 10[51000] via P2P/IPC/read
user:169128:169176 [3] NCCL INFO Channel 01/0 : 11[57000] -> 10[51000] via P2P/IPC/read
user:169127:169179 [2] NCCL INFO Channel 00/0 : 10[51000] -> 2[51000] [send] via NET/Socket/0
user:169127:169179 [2] NCCL INFO Channel 01/0 : 10[51000] -> 2[51000] [send] via NET/Socket/0
user:852752:852806 [2] NCCL INFO Connected all rings
user:852752:852806 [2] NCCL INFO Channel 00/0 : 2[51000] -> 3[57000] via P2P/IPC/read
user:169132:169177 [7] NCCL INFO Connected all rings
user:852752:852806 [2] NCCL INFO Channel 01/0 : 2[51000] -> 3[57000] via P2P/IPC/read
user:852753:852808 [3] NCCL INFO Connected all trees
user:852753:852808 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852753:852808 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852750:852802 [0] NCCL INFO Connected all rings
user:852750:852802 [0] NCCL INFO Channel 00/0 : 0[27000] -> 1[2a000] via P2P/IPC/read
user:169130:169180 [5] NCCL INFO Connected all trees
user:169130:169180 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169130:169180 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169129:169182 [4] NCCL INFO Connected all trees
user:169129:169182 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169129:169182 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852752:852806 [2] NCCL INFO Channel 00/0 : 10[51000] -> 2[51000] [receive] via NET/Socket/0
user:169125:169181 [0] NCCL INFO Connected all rings
user:169125:169181 [0] NCCL INFO Channel 00/0 : 8[27000] -> 15[ca000] via P2P/IPC/read
user:169125:169181 [0] NCCL INFO Channel 01/0 : 8[27000] -> 15[ca000] via P2P/IPC/read
user:852751:852807 [1] NCCL INFO Connected all rings
user:852750:852802 [0] NCCL INFO Channel 01/0 : 0[27000] -> 1[2a000] via P2P/IPC/read
user:852751:852807 [1] NCCL INFO Channel 00/0 : 1[2a000] -> 0[27000] via P2P/IPC/read
user:852751:852807 [1] NCCL INFO Channel 01/0 : 1[2a000] -> 0[27000] via P2P/IPC/read
user:852752:852806 [2] NCCL INFO Channel 01/0 : 10[51000] -> 2[51000] [receive] via NET/Socket/0
user:852757:852804 [7] NCCL INFO Connected all trees
user:852757:852804 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852757:852804 [7] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852752:852806 [2] NCCL INFO Channel 00/0 : 2[51000] -> 10[51000] [send] via NET/Socket/0
user:852752:852806 [2] NCCL INFO Channel 01/0 : 2[51000] -> 10[51000] [send] via NET/Socket/0
user:852751:852807 [1] NCCL INFO Connected all trees
user:852751:852807 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852751:852807 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852750:852802 [0] NCCL INFO Connected all trees
user:852750:852802 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852750:852802 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169128:169176 [3] NCCL INFO Connected all trees
user:169128:169176 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169128:169176 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852752:852806 [2] NCCL INFO Connected all trees
user:852752:852806 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:852752:852806 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169132:169177 [7] NCCL INFO Channel 00/0 : 15[ca000] -> 8[27000] via P2P/IPC/read
user:169132:169177 [7] NCCL INFO Channel 01/0 : 15[ca000] -> 8[27000] via P2P/IPC/read
user:169132:169177 [7] NCCL INFO Channel 00/0 : 15[ca000] -> 14[c7000] via P2P/IPC/read
user:169132:169177 [7] NCCL INFO Channel 01/0 : 15[ca000] -> 14[c7000] via P2P/IPC/read
user:169126:169178 [1] NCCL INFO Connected all trees
user:169126:169178 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169126:169178 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169125:169181 [0] NCCL INFO Connected all trees
user:169125:169181 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169125:169181 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169131:169183 [6] NCCL INFO Connected all trees
user:169131:169183 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169131:169183 [6] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169132:169177 [7] NCCL INFO Connected all trees
user:169132:169177 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169132:169177 [7] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:852757:852804 [7] NCCL INFO comm 0x558c2343afa0 rank 7 nranks 16 cudaDev 7 busId ca000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852755:852803 [5] NCCL INFO comm 0x5591311fb270 rank 5 nranks 16 cudaDev 5 busId a4000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852754:852809 [4] NCCL INFO comm 0x555c22882030 rank 4 nranks 16 cudaDev 4 busId 9e000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852752:852806 [2] NCCL INFO comm 0x564976185500 rank 2 nranks 16 cudaDev 2 busId 51000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852753:852808 [3] NCCL INFO comm 0x5591b0312e30 rank 3 nranks 16 cudaDev 3 busId 57000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852756:852805 [6] NCCL INFO comm 0x563073d8e380 rank 6 nranks 16 cudaDev 6 busId c7000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852751:852807 [1] NCCL INFO comm 0x55f7e8a93c40 rank 1 nranks 16 cudaDev 1 busId 2a000 commId 0x86d61c6d254b547e - Init COMPLETE
user:852750:852802 [0] NCCL INFO comm 0x5564a97c5bf0 rank 0 nranks 16 cudaDev 0 busId 27000 commId 0x86d61c6d254b547e - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
user:169127:169179 [2] NCCL INFO Connected all trees
user:169127:169179 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
user:169127:169179 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:169128:169176 [3] NCCL INFO comm 0x55728d3d7a70 rank 11 nranks 16 cudaDev 3 busId 57000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169131:169183 [6] NCCL INFO comm 0x56379ea7ca40 rank 14 nranks 16 cudaDev 6 busId c7000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169125:169181 [0] NCCL INFO comm 0x5579a287c1f0 rank 8 nranks 16 cudaDev 0 busId 27000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169127:169179 [2] NCCL INFO comm 0x559424090100 rank 10 nranks 16 cudaDev 2 busId 51000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169129:169182 [4] NCCL INFO comm 0x55f294ca5760 rank 12 nranks 16 cudaDev 4 busId 9e000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169126:169178 [1] NCCL INFO comm 0x55b843fe99f0 rank 9 nranks 16 cudaDev 1 busId 2a000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169130:169180 [5] NCCL INFO comm 0x564b53575070 rank 13 nranks 16 cudaDev 5 busId a4000 commId 0x86d61c6d254b547e - Init COMPLETE
user:169132:169177 [7] NCCL INFO comm 0x56412adcd570 rank 15 nranks 16 cudaDev 7 busId ca000 commId 0x86d61c6d254b547e - Init COMPLETE
134217728 33554432 float sum -1 42497 3.16 5.92 0 41758 3.21 6.03 0
268435456 67108864 float sum -1 86999 3.09 5.79 0 86767 3.09 5.80 0
536870912 134217728 float sum -1 175858 3.05 5.72 0 172967 3.10 5.82 0
user:852755:852755 [5] NCCL INFO comm 0x5591311fb270 rank 5 nranks 16 cudaDev 5 busId a4000 - Destroy COMPLETE
user:169126:169126 [1] NCCL INFO comm 0x55b843fe99f0 rank 9 nranks 16 cudaDev 1 busId 2a000 - Destroy COMPLETE
user:169128:169128 [3] NCCL INFO comm 0x55728d3d7a70 rank 11 nranks 16 cudaDev 3 busId 57000 - Destroy COMPLETE
user:169130:169130 [5] NCCL INFO comm 0x564b53575070 rank 13 nranks 16 cudaDev 5 busId a4000 - Destroy COMPLETE
user:169129:169129 [4] NCCL INFO comm 0x55f294ca5760 rank 12 nranks 16 cudaDev 4 busId 9e000 - Destroy COMPLETE
user:852756:852756 [6] NCCL INFO comm 0x563073d8e380 rank 6 nranks 16 cudaDev 6 busId c7000 - Destroy COMPLETE
user:852753:852753 [3] NCCL INFO comm 0x5591b0312e30 rank 3 nranks 16 cudaDev 3 busId 57000 - Destroy COMPLETE
user:169132:169132 [7] NCCL INFO comm 0x56412adcd570 rank 15 nranks 16 cudaDev 7 busId ca000 - Destroy COMPLETE
user:852757:852757 [7] NCCL INFO comm 0x558c2343afa0 rank 7 nranks 16 cudaDev 7 busId ca000 - Destroy COMPLETE
user:852751:852751 [1] NCCL INFO comm 0x55f7e8a93c40 rank 1 nranks 16 cudaDev 1 busId 2a000 - Destroy COMPLETE
user:169131:169131 [6] NCCL INFO comm 0x56379ea7ca40 rank 14 nranks 16 cudaDev 6 busId c7000 - Destroy COMPLETE
user:169127:169127 [2] NCCL INFO comm 0x559424090100 rank 10 nranks 16 cudaDev 2 busId 51000 - Destroy COMPLETE
user:852750:852750 [0] NCCL INFO comm 0x5564a97c5bf0 rank 0 nranks 16 cudaDev 0 busId 27000 - Destroy COMPLETE
user:852752:852752 [2] NCCL INFO comm 0x564976185500 rank 2 nranks 16 cudaDev 2 busId 51000 - Destroy COMPLETE
user:169125:169125 [0] NCCL INFO comm 0x5579a287c1f0 rank 8 nranks 16 cudaDev 0 busId 27000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 5.8464
#
user:852754:852754 [4] NCCL INFO comm 0x555c22882030 rank 4 nranks 16 cudaDev 4 busId 9e000 - Destroy COMPLETE
All I can see is that on one of the nodes you get:
user:16905:16951 [6] NCCL INFO NET/IB : No device found.
That tends to indicate the IB verbs library (libibverbs.so) is missing, or the interfaces are not forwarded to the container (if using a container). You should run ibv_devinfo
to check the interfaces are up and running.
On that same node you see to have IP over IB interfaces though:
user:16905:16951 [6] NCCL INFO NET/Socket : Using [0]ibs85f0:192.168.1.10<0> [1]ib0:192.168.1.11<0> [2]ibs102f0:192.168.1.12<0> [3]ib2:192.168.1.13<0>
so sockets will use that and get better than 100MBps.
I have fixed this issue, thanks @sjeaugey
reinstall MLNX_OFED
After starting up, enter the command /etc/init.d/openibd restart systemctl restart opensmd
single node or set NCCL_IB_DISABLE=1 is correctly. using IB (InfiniBand) result following error: