NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
909 stars 244 forks source link

Running nccl-test on two nodes failed #133

Open zhangciba opened 1 year ago

zhangciba commented 1 year ago

I have two nodes, name nodea and node b, both has 8 A800 GPU

nodea has 5 roce network: xgbe0 for cpu, xgbe2/4/6/8 for gpu

nodea has 5 roce network: xgbe4 for cpu, xgbe0/2/6/8 for gpu

Network Connectivity are below: nodea xgbe0 <-> nodeb xgbe4 nodea xgbe2/6 <-> nodeb xgbe0/6 nodea xgbe4/8 <-> nodeb xgbe2/8

I set env on node a export NCCL_SOCKET_IFNAME=xgbe4

set env on nodeb export NCCL_SOCKET_IFNAME=xgbe0

and I am running nccl test on this two nodes

/home/openmpi/bin/mpirun --allow-run-as-root \
        --np ${node_nums} \
        --map-by node \
        --mca btl_tcp_if_exclude docker0,lo \
        --mca orte_base_help_aggregate 0 \
        --hostfile ./ip \
        -x NCCL_DEBUG=INFO \
        -x NCCL_IB_HCA=mlx5_6,mlx_8 \
        -x NCCL_IB_GID_INDEX=3 \
        -x NCCL_ALGO= \
        -x PATH \
        -x LD_LIBRARY_PATH \
        ./build/all_reduce_perf -b 1024 -e 1024M -f 2 -g 1 -t 8 -c 0 -n 20

node_nums=2 ip file context is :

nodea
nodeb

and I get the result:

# nThread 8 nGpus 1 minBytes 1024 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
#
# Using devices
#   Rank  0 Pid  26799 on nodea device  0 [0x10] NVIDIA A800-SXM4-80GB
#   Rank  1 Pid  26799 on nodea device  1 [0x16] NVIDIA A800-SXM4-80GB
#   Rank  2 Pid  26799 on nodea device  2 [0x49] NVIDIA A800-SXM4-80GB
#   Rank  3 Pid  26799 on nodea device  3 [0x4d] NVIDIA A800-SXM4-80GB
#   Rank  4 Pid  26799 on nodea device  4 [0x89] NVIDIA A800-SXM4-80GB
#   Rank  5 Pid  26799 on nodea device  5 [0x8e] NVIDIA A800-SXM4-80GB
#   Rank  6 Pid  26799 on nodea device  6 [0xc5] NVIDIA A800-SXM4-80GB
#   Rank  7 Pid  26799 on nodea device  7 [0xc9] NVIDIA A800-SXM4-80GB
#   Rank  8 Pid  65671 on nodeb device  0 [0x20] NVIDIA A800-SXM4-80GB
#   Rank  9 Pid  65671 on nodeb device  1 [0x26] NVIDIA A800-SXM4-80GB
#   Rank 10 Pid  65671 on nodeb device  2 [0x50] NVIDIA A800-SXM4-80GB
#   Rank 11 Pid  65671 on nodeb device  3 [0x55] NVIDIA A800-SXM4-80GB
#   Rank 12 Pid  65671 on nodeb device  4 [0x8d] NVIDIA A800-SXM4-80GB
#   Rank 13 Pid  65671 on nodeb device  5 [0x92] NVIDIA A800-SXM4-80GB
#   Rank 14 Pid  65671 on nodeb device  6 [0xc9] NVIDIA A800-SXM4-80GB
#   Rank 15 Pid  65671 on nodeb device  7 [0xcf] NVIDIA A800-SXM4-80GB
nodea:26799:26799 [0] NCCL INFO Bootstrap : Using xgbe4:nodea_ip<0>
nodea:26799:26799 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nodea:26799:26799 [0] NCCL INFO NET/IB : Using [0]mlx5_6:1/RoCE ; OOB xgbe4:nodea_ip<0>
nodea:26799:26799 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda11.0
nodeb:65671:65671 [0] NCCL INFO Bootstrap : Using xgbe0:nodeb_xgbe0_ip<0>
nodeb:65671:65671 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nodeb:65671:65671 [0] NCCL INFO NET/IB : Using [0]mlx5_6:1/RoCE ; OOB xgbe0:nodeb_xgbe0_ip<0>
nodeb:65671:65671 [0] NCCL INFO Using network IB
nodeb:65671:65860 [7] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
nodea:26799:27027 [2] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
nodeb:65671:65858 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12
nodeb:65671:65859 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13
nodeb:65671:65854 [4] NCCL INFO Trees [0] 13/-1/-1->12->4 [1] 13/4/-1->12->-1
nodeb:65671:65853 [3] NCCL INFO Trees [0] -1/-1/-1->11->10 [1] -1/-1/-1->11->10
nodea:26799:27025 [0] NCCL INFO Channel 00/02 :    0   7   6   5  12  11  10   9   8  15  14  13   4   3   2   1
nodea:26799:27025 [0] NCCL INFO Channel 01/02 :    0   7   6   5  12  11  10   9   8  15  14  13   4   3   2   1
nodea:26799:27025 [0] NCCL INFO Trees [0] 1/-1/-1->0->7 [1] 1/-1/-1->0->7
nodea:26799:27025 [0] NCCL INFO Setting affinity for GPU 0 to 01,00000000,00000001
nodeb:65671:65853 [3] NCCL INFO Setting affinity for GPU 3 to 01,00000000,00000001
nodeb:65671:65860 [7] NCCL INFO Trees [0] 8/-1/-1->15->14 [1] 8/-1/-1->15->14
nodea:26799:27025 [0] NCCL INFO Channel 00 : 0[10000] -> 7[c9000] via P2P/direct pointer/read
nodea:26799:27025 [0] NCCL INFO Channel 01 : 0[10000] -> 7[c9000] via P2P/direct pointer/read
nodea:26799:27026 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
nodea:26799:27026 [1] NCCL INFO Setting affinity for GPU 1 to 01,00000000,00000001
nodea:26799:27026 [1] NCCL INFO Channel 00 : 1[16000] -> 0[10000] via P2P/direct pointer/read
nodea:26799:27026 [1] NCCL INFO Channel 01 : 1[16000] -> 0[10000] via P2P/direct pointer/read
nodea:26799:27027 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
nodea:26799:27027 [2] NCCL INFO Setting affinity for GPU 2 to 01,00000000,00000001
nodea:26799:27027 [2] NCCL INFO Channel 00 : 2[49000] -> 1[16000] via P2P/direct pointer/read
nodea:26799:27027 [2] NCCL INFO Channel 01 : 2[49000] -> 1[16000] via P2P/direct pointer/read
nodea:26799:27028 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
nodea:26799:27028 [3] NCCL INFO Setting affinity for GPU 3 to 01,00000000,00000001
nodea:26799:27028 [3] NCCL INFO Channel 00 : 3[4d000] -> 2[49000] via P2P/direct pointer/read
nodea:26799:27028 [3] NCCL INFO Channel 01 : 3[4d000] -> 2[49000] via P2P/direct pointer/read
nodea:26799:27030 [4] NCCL INFO Trees [0] 5/12/-1->4->-1 [1] 5/-1/-1->4->12
nodea:26799:27030 [4] NCCL INFO Channel 00 : 13[92000] -> 4[89000] [receive] via NET/IB/0/GDRDMA
nodea:26799:27030 [4] NCCL INFO Channel 01 : 13[92000] -> 4[89000] [receive] via NET/IB/0/GDRDMA
nodea:26799:27027 [2] NCCL INFO Connected all rings
nodea:26799:27027 [2] NCCL INFO Channel 00 : 2[49000] -> 3[4d000] via P2P/direct pointer/read
nodea:26799:27027 [2] NCCL INFO Channel 01 : 2[49000] -> 3[4d000] via P2P/direct pointer/read
nodea:26799:27033 [7] NCCL INFO Trees [0] 0/-1/-1->7->6 [1] 0/-1/-1->7->6
nodeb:65671:65852 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->9
nodeb:65671:65852 [2] NCCL INFO Setting affinity for GPU 2 to 01,00000000,00000001
nodeb:65671:65850 [0] NCCL INFO Trees [0] 9/-1/-1->8->15 [1] 9/-1/-1->8->15
nodeb:65671:65850 [0] NCCL INFO Setting affinity for GPU 0 to 01,00000000,00000001
nodeb:65671:65851 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] 10/-1/-1->9->8
nodeb:65671:65851 [1] NCCL INFO Setting affinity for GPU 1 to 01,00000000,00000001
nodea:26799:27033 [7] NCCL INFO Channel 00 : 7[c9000] -> 6[c5000] via P2P/direct pointer/read
nodea:26799:27033 [7] NCCL INFO Channel 01 : 7[c9000] -> 6[c5000] via P2P/direct pointer/read
nodea:26799:27025 [0] NCCL INFO Connected all rings
nodea:26799:27025 [0] NCCL INFO Channel 00 : 0[10000] -> 1[16000] via P2P/direct pointer/read
nodea:26799:27025 [0] NCCL INFO Channel 01 : 0[10000] -> 1[16000] via P2P/direct pointer/read
nodea:26799:27026 [1] NCCL INFO Connected all rings
nodea:26799:27026 [1] NCCL INFO Channel 00 : 1[16000] -> 2[49000] via P2P/direct pointer/read
nodea:26799:27026 [1] NCCL INFO Channel 01 : 1[16000] -> 2[49000] via P2P/direct pointer/read
nodea:26799:27026 [1] NCCL INFO Connected all trees
nodea:26799:27026 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodea:26799:27026 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodea:26799:27032 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
nodea:26799:27032 [6] NCCL INFO Channel 00 : 6[c5000] -> 5[8e000] via P2P/direct pointer/read
nodea:26799:27032 [6] NCCL INFO Channel 01 : 6[c5000] -> 5[8e000] via P2P/direct pointer/read
nodea:26799:27031 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
nodeb:65671:65858 [5] NCCL INFO Channel 00 : 13[92000] -> 4[89000] [send] via NET/IB/0/GDRDMA
nodea:26799:27033 [7] NCCL INFO Connected all rings
nodea:26799:27031 [5] NCCL INFO Channel 00 : 5[8e000] -> 12[8d000] [send] via NET/IB/0/GDRDMA
nodea:26799:27031 [5] NCCL INFO Channel 01 : 5[8e000] -> 12[8d000] [send] via NET/IB/0/GDRDMA
nodeb:65671:65859 [6] NCCL INFO Channel 00 : 14[c9000] -> 13[92000] via P2P/direct pointer/read
nodeb:65671:65854 [4] NCCL INFO Channel 00 : 5[8e000] -> 12[8d000] [receive] via NET/IB/0/GDRDMA
nodeb:65671:65859 [6] NCCL INFO Channel 01 : 14[c9000] -> 13[92000] via P2P/direct pointer/read
nodeb:65671:65858 [5] NCCL INFO Channel 01 : 13[92000] -> 4[89000] [send] via NET/IB/0/GDRDMA
nodeb:65671:65850 [0] NCCL INFO Channel 00 : 8[20000] -> 15[cf000] via P2P/direct pointer/read
nodeb:65671:65853 [3] NCCL INFO Channel 00 : 11[55000] -> 10[50000] via P2P/direct pointer/read
nodeb:65671:65850 [0] NCCL INFO Channel 01 : 8[20000] -> 15[cf000] via P2P/direct pointer/read
nodea:26799:27030 [4] NCCL INFO Channel 00 : 4[89000] -> 3[4d000] via P2P/direct pointer/read
nodea:26799:27030 [4] NCCL INFO Channel 01 : 4[89000] -> 3[4d000] via P2P/direct pointer/read
nodeb:65671:65854 [4] NCCL INFO Channel 01 : 5[8e000] -> 12[8d000] [receive] via NET/IB/0/GDRDMA
nodea:26799:27030 [4] NCCL INFO Connected all rings
nodea:26799:27030 [4] NCCL INFO Channel 00 : 4[89000] -> 5[8e000] via P2P/direct pointer/read
nodea:26799:27030 [4] NCCL INFO Channel 01 : 4[89000] -> 5[8e000] via P2P/direct pointer/read
nodeb:65671:65851 [1] NCCL INFO Channel 00 : 9[26000] -> 8[20000] via P2P/direct pointer/read
nodea:26799:27028 [3] NCCL INFO Connected all rings
nodeb:65671:65853 [3] NCCL INFO Channel 01 : 11[55000] -> 10[50000] via P2P/direct pointer/read
nodea:26799:27028 [3] NCCL INFO Connected all trees
nodea:26799:27028 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodea:26799:27028 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodeb:65671:65852 [2] NCCL INFO Channel 00 : 10[50000] -> 9[26000] via P2P/direct pointer/read
nodea:26799:27027 [2] NCCL INFO Connected all trees
nodea:26799:27027 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodea:26799:27027 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodeb:65671:65852 [2] NCCL INFO Channel 01 : 10[50000] -> 9[26000] via P2P/direct pointer/read
nodeb:65671:65851 [1] NCCL INFO Channel 01 : 9[26000] -> 8[20000] via P2P/direct pointer/read
nodea:26799:27031 [5] NCCL INFO Connected all rings
nodea:26799:27032 [6] NCCL INFO Connected all rings
nodea:26799:27031 [5] NCCL INFO Channel 00 : 5[8e000] -> 6[c5000] via P2P/direct pointer/read
nodea:26799:27031 [5] NCCL INFO Channel 01 : 5[8e000] -> 6[c5000] via P2P/direct pointer/read
nodea:26799:27032 [6] NCCL INFO Channel 00 : 6[c5000] -> 7[c9000] via P2P/direct pointer/read
nodea:26799:27032 [6] NCCL INFO Channel 01 : 6[c5000] -> 7[c9000] via P2P/direct pointer/read
nodea:26799:27030 [4] NCCL INFO Channel 00 : 12[8d000] -> 4[89000] [receive] via NET/IB/0/GDRDMA
nodea:26799:27032 [6] NCCL INFO Connected all trees
nodea:26799:27032 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodea:26799:27032 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodea:26799:27033 [7] NCCL INFO Channel 00 : 7[c9000] -> 0[10000] via P2P/direct pointer/read
nodea:26799:27033 [7] NCCL INFO Channel 01 : 7[c9000] -> 0[10000] via P2P/direct pointer/read
nodea:26799:27033 [7] NCCL INFO Connected all trees
nodea:26799:27033 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodea:26799:27033 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodea:26799:27025 [0] NCCL INFO Connected all trees

nodea:26799:27025 [0] graph/tuning.cc:187 NCCL WARN CollNet is not supported or fails to initialize, ignoring NCCL_ALGO=COLLNET
nodea:26799:27025 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodea:26799:27025 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodea:26799:27031 [5] NCCL INFO Channel 00 : 5[8e000] -> 4[89000] via P2P/direct pointer/read
nodeb:65671:65852 [2] NCCL INFO Connected all rings
nodea:26799:27030 [4] NCCL INFO Channel 01 : 12[8d000] -> 4[89000] [receive] via NET/IB/0/GDRDMA
nodea:26799:27031 [5] NCCL INFO Channel 01 : 5[8e000] -> 4[89000] via P2P/direct pointer/read
nodea:26799:27030 [4] NCCL INFO Channel 00 : 4[89000] -> 12[8d000] [send] via NET/IB/0/GDRDMA
nodea:26799:27030 [4] NCCL INFO Channel 01 : 4[89000] -> 12[8d000] [send] via NET/IB/0/GDRDMA
nodeb:65671:65858 [5] NCCL INFO Connected all rings
nodeb:65671:65860 [7] NCCL INFO Channel 00 : 15[cf000] -> 14[c9000] via P2P/direct pointer/read
nodeb:65671:65854 [4] NCCL INFO Channel 00 : 12[8d000] -> 11[55000] via P2P/direct pointer/read
nodeb:65671:65850 [0] NCCL INFO Connected all rings
nodeb:65671:65851 [1] NCCL INFO Connected all rings
nodeb:65671:65852 [2] NCCL INFO Channel 00 : 10[50000] -> 11[55000] via P2P/direct pointer/read
nodeb:65671:65850 [0] NCCL INFO Channel 00 : 8[20000] -> 9[26000] via P2P/direct pointer/read
nodeb:65671:65854 [4] NCCL INFO Channel 01 : 12[8d000] -> 11[55000] via P2P/direct pointer/read
nodeb:65671:65860 [7] NCCL INFO Channel 01 : 15[cf000] -> 14[c9000] via P2P/direct pointer/read
nodeb:65671:65859 [6] NCCL INFO Connected all rings
nodeb:65671:65852 [2] NCCL INFO Channel 01 : 10[50000] -> 11[55000] via P2P/direct pointer/read
nodeb:65671:65853 [3] NCCL INFO Connected all rings
nodeb:65671:65860 [7] NCCL INFO Connected all rings
nodeb:65671:65858 [5] NCCL INFO Channel 00 : 13[92000] -> 14[c9000] via P2P/direct pointer/read
nodeb:65671:65850 [0] NCCL INFO Channel 01 : 8[20000] -> 9[26000] via P2P/direct pointer/read
nodeb:65671:65858 [5] NCCL INFO Channel 01 : 13[92000] -> 14[c9000] via P2P/direct pointer/read
nodeb:65671:65854 [4] NCCL INFO Connected all rings
nodeb:65671:65851 [1] NCCL INFO Channel 00 : 9[26000] -> 10[50000] via P2P/direct pointer/read
nodeb:65671:65853 [3] NCCL INFO Connected all trees
nodeb:65671:65853 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodeb:65671:65853 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodeb:65671:65859 [6] NCCL INFO Channel 00 : 14[c9000] -> 15[cf000] via P2P/direct pointer/read
nodeb:65671:65859 [6] NCCL INFO Channel 01 : 14[c9000] -> 15[cf000] via P2P/direct pointer/read
nodeb:65671:65859 [6] NCCL INFO Connected all trees
nodeb:65671:65859 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodeb:65671:65859 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodeb:65671:65851 [1] NCCL INFO Channel 01 : 9[26000] -> 10[50000] via P2P/direct pointer/read
nodeb:65671:65854 [4] NCCL INFO Channel 00 : 12[8d000] -> 13[92000] via P2P/direct pointer/read
nodeb:65671:65860 [7] NCCL INFO Channel 00 : 15[cf000] -> 8[20000] via P2P/direct pointer/read
nodeb:65671:65851 [1] NCCL INFO Connected all trees
nodeb:65671:65851 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodeb:65671:65851 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodeb:65671:65852 [2] NCCL INFO Connected all trees
nodeb:65671:65852 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodeb:65671:65852 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodeb:65671:65854 [4] NCCL INFO Channel 01 : 12[8d000] -> 13[92000] via P2P/direct pointer/read
nodeb:65671:65858 [5] NCCL INFO Channel 00 : 13[92000] -> 12[8d000] via P2P/direct pointer/read
nodeb:65671:65858 [5] NCCL INFO Channel 01 : 13[92000] -> 12[8d000] via P2P/direct pointer/read
nodeb:65671:65860 [7] NCCL INFO Channel 01 : 15[cf000] -> 8[20000] via P2P/direct pointer/read
nodeb:65671:65850 [0] NCCL INFO Connected all trees
nodeb:65671:65850 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodeb:65671:65850 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodeb:65671:65860 [7] NCCL INFO Connected all trees
nodeb:65671:65860 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodeb:65671:65860 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodeb:65671:65854 [4] NCCL INFO Channel 00 : 4[89000] -> 12[8d000] [receive] via NET/IB/0/GDRDMA
nodeb:65671:65854 [4] NCCL INFO Channel 01 : 4[89000] -> 12[8d000] [receive] via NET/IB/0/GDRDMA
nodeb:65671:65854 [4] NCCL INFO Channel 00 : 12[8d000] -> 4[89000] [send] via NET/IB/0/GDRDMA
nodeb:65671:65854 [4] NCCL INFO Channel 01 : 12[8d000] -> 4[89000] [send] via NET/IB/0/GDRDMA
nodea:26799:27031 [5] NCCL INFO Connected all trees
nodea:26799:27031 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodea:26799:27031 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodea:26799:27030 [4] NCCL INFO Connected all trees
nodea:26799:27030 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodea:26799:27030 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodea:26799:27031 [5] NCCL INFO comm 0x7fdcd8000e20 rank 5 nranks 16 cudaDev 5 busId 8e000 - Init COMPLETE
nodea:26799:27033 [7] NCCL INFO comm 0x7fdce0000e20 rank 7 nranks 16 cudaDev 7 busId c9000 - Init COMPLETE
nodea:26799:27028 [3] NCCL INFO comm 0x7fdce8000e20 rank 3 nranks 16 cudaDev 3 busId 4d000 - Init COMPLETE
nodea:26799:27032 [6] NCCL INFO comm 0x7fdcd4000e20 rank 6 nranks 16 cudaDev 6 busId c5000 - Init COMPLETE
nodea:26799:27027 [2] NCCL INFO comm 0x7fdcdc000e20 rank 2 nranks 16 cudaDev 2 busId 49000 - Init COMPLETE
nodea:26799:27026 [1] NCCL INFO comm 0x7fdcf0000e20 rank 1 nranks 16 cudaDev 1 busId 16000 - Init COMPLETE
nodea:26799:27025 [0] NCCL INFO comm 0x7fdcec000e20 rank 0 nranks 16 cudaDev 0 busId 10000 - Init COMPLETE
nodea:26799:27030 [4] NCCL INFO comm 0x7fdce4000e20 rank 4 nranks 16 cudaDev 4 busId 89000 - Init COMPLETE
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
nodea:26799:26799 [0] NCCL INFO Launch mode Parallel
nodeb:65671:65858 [5] NCCL INFO Connected all trees
nodeb:65671:65858 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodeb:65671:65858 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodeb:65671:65854 [4] NCCL INFO Connected all trees
nodeb:65671:65854 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
nodeb:65671:65854 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
nodeb:65671:65860 [7] NCCL INFO comm 0x7fe030000e20 rank 15 nranks 16 cudaDev 7 busId cf000 - Init COMPLETE
nodeb:65671:65858 [5] NCCL INFO comm 0x7fe038000e20 rank 13 nranks 16 cudaDev 5 busId 92000 - Init COMPLETE
nodeb:65671:65853 [3] NCCL INFO comm 0x7fe03c000e20 rank 11 nranks 16 cudaDev 3 busId 55000 - Init COMPLETE
nodeb:65671:65851 [1] NCCL INFO comm 0x7fe044000e20 rank 9 nranks 16 cudaDev 1 busId 26000 - Init COMPLETE
nodeb:65671:65854 [4] NCCL INFO comm 0x7fe034000e20 rank 12 nranks 16 cudaDev 4 busId 8d000 - Init COMPLETE
nodeb:65671:65850 [0] NCCL INFO comm 0x7fe04c000e20 rank 8 nranks 16 cudaDev 0 busId 20000 - Init COMPLETE
nodeb:65671:65852 [2] NCCL INFO comm 0x7fe048000e20 rank 10 nranks 16 cudaDev 2 busId 50000 - Init COMPLETE
nodeb:65671:65859 [6] NCCL INFO comm 0x7fe02c000e20 rank 14 nranks 16 cudaDev 6 busId c9000 - Init COMPLETE
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: nodea
  PID:        26799
  Message:    connect() to nodeb_xgbe0_ip:1024 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: nodeb
  PID:        65671
  Message:    connect() to nodea_xgbe0_ip:1024 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------

can nccl use different network card to communicate?

sjeaugey commented 1 year ago

The error here is due to MPI struggling to find the right interface. And it's not even easy to tell MPI which interface to use given they're not the same depending on the node.

Now even if you went past the MPI issue, I'm not sure how you'd make NCCL communicate across the right interfaces. There is not way to specify a particular network connectivity, and even if we had that information, I'm not sure how we'd be supposed to communicate between GPU X on one node and GPU Y on another node if their NICs are not connected to each other.

So, you should really make sure nodes are identical, and all NICs can talk to all others.