Closed JuiceLemonLemon closed 3 months ago
Please run with NCCL_DEBUG=INFO
to see what NICs NCCL chooses. You may need to exclude mlx5_3 and mlx5_4 with something like NCCL_IB_HCA=^mlx5_3,mlx5_4
.
And thank you for including so much relevant info in your report, BTW!
Please run with
NCCL_DEBUG=INFO
to see what NICs NCCL chooses. You may need to exclude mlx5_3 and mlx5_4 with something likeNCCL_IB_HCA=^mlx5_3,mlx5_4
.
ok, I run the below command.
NCCL_DEBUG=INFO NCCL_IB_HCA=^mlx5_3,mlx5_4 mpirun --bind-to none --mca btl '^openib' -n 2 --host ip1,ip2 -x LD_LIBRARY_PATH ./build/all_gather_perf -b 16M -e 1024M -i 16777216 -g 8 -d half -f 2
# nThread 1 nGpus 8 minBytes 16777216 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 2017630 on swat1-04 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 2017630 on swat1-04 device 1 [0x0b] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 2017630 on swat1-04 device 2 [0x48] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 2017630 on swat1-04 device 3 [0x4c] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 2017630 on swat1-04 device 4 [0x88] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 2017630 on swat1-04 device 5 [0x8b] NVIDIA A100-SXM4-80GB
# Rank 6 Group 0 Pid 2017630 on swat1-04 device 6 [0xc8] NVIDIA A100-SXM4-80GB
# Rank 7 Group 0 Pid 2017630 on swat1-04 device 7 [0xcb] NVIDIA A100-SXM4-80GB
# Rank 8 Group 0 Pid 1161920 on swat1-05 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 9 Group 0 Pid 1161920 on swat1-05 device 1 [0x0b] NVIDIA A100-SXM4-80GB
# Rank 10 Group 0 Pid 1161920 on swat1-05 device 2 [0x48] NVIDIA A100-SXM4-80GB
# Rank 11 Group 0 Pid 1161920 on swat1-05 device 3 [0x4c] NVIDIA A100-SXM4-80GB
# Rank 12 Group 0 Pid 1161920 on swat1-05 device 4 [0x88] NVIDIA A100-SXM4-80GB
# Rank 13 Group 0 Pid 1161920 on swat1-05 device 5 [0x8b] NVIDIA A100-SXM4-80GB
# Rank 14 Group 0 Pid 1161920 on swat1-05 device 6 [0xc8] NVIDIA A100-SXM4-80GB
# Rank 15 Group 0 Pid 1161920 on swat1-05 device 7 [0xcb] NVIDIA A100-SXM4-80GB
swat1-04:2017630:2017630 [0] NCCL INFO Bootstrap : Using ens21f0:75.12.36.64<0>
swat1-04:2017630:2017630 [0] NCCL INFO cudaDriverVersion 12020
swat1-04:2017630:2017630 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
swat1-04:2017630:2017672 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
swat1-04:2017630:2017672 [3] NCCL INFO NCCL_IB_HCA set to ^mlx5_3,mlx5_4
swat1-04:2017630:2017672 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_5:1/IB [RO]; OOB ens21f0:75.12.36.64<0>
swat1-04:2017630:2017675 [6] NCCL INFO Using network IB
swat1-04:2017630:2017671 [2] NCCL INFO Using network IB
swat1-04:2017630:2017672 [3] NCCL INFO Using network IB
swat1-04:2017630:2017670 [1] NCCL INFO Using network IB
swat1-04:2017630:2017669 [0] NCCL INFO Using network IB
swat1-04:2017630:2017676 [7] NCCL INFO Using network IB
swat1-04:2017630:2017673 [4] NCCL INFO Using network IB
swat1-04:2017630:2017674 [5] NCCL INFO Using network IB
swat1-04:2017630:2017673 [4] NCCL INFO ncclCommInitRank comm 0x8e48640 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 88000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017675 [6] NCCL INFO ncclCommInitRank comm 0x8ec0ac0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId c8000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017670 [1] NCCL INFO ncclCommInitRank comm 0x8d93f80 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId b000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017674 [5] NCCL INFO ncclCommInitRank comm 0x8e84880 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId 8b000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017672 [3] NCCL INFO ncclCommInitRank comm 0x8e0c400 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 4c000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017676 [7] NCCL INFO ncclCommInitRank comm 0x8efcbc0 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId cb000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017669 [0] NCCL INFO ncclCommInitRank comm 0x8d57d40 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 7000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017671 [2] NCCL INFO ncclCommInitRank comm 0x8dd01c0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 48000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017670 [1] NCCL INFO Setting affinity for GPU 1 to ff000000,00000000,ff000000
swat1-04:2017630:2017671 [2] NCCL INFO Setting affinity for GPU 2 to ff00,00000000,0000ff00
swat1-04:2017630:2017670 [1] NCCL INFO NVLS multicast support is not available on dev 1
swat1-04:2017630:2017671 [2] NCCL INFO NVLS multicast support is not available on dev 2
swat1-04:2017630:2017673 [4] NCCL INFO Setting affinity for GPU 4 to ff000000,00000000,ff000000,00000000
swat1-04:2017630:2017673 [4] NCCL INFO NVLS multicast support is not available on dev 4
swat1-04:2017630:2017674 [5] NCCL INFO Setting affinity for GPU 5 to ff000000,00000000,ff000000,00000000
swat1-04:2017630:2017674 [5] NCCL INFO NVLS multicast support is not available on dev 5
swat1-04:2017630:2017669 [0] NCCL INFO Setting affinity for GPU 0 to ff000000,00000000,ff000000
swat1-04:2017630:2017675 [6] NCCL INFO Setting affinity for GPU 6 to ff00,00000000,0000ff00,00000000
swat1-04:2017630:2017669 [0] NCCL INFO NVLS multicast support is not available on dev 0
swat1-04:2017630:2017675 [6] NCCL INFO NVLS multicast support is not available on dev 6
swat1-04:2017630:2017672 [3] NCCL INFO Setting affinity for GPU 3 to ff00,00000000,0000ff00
swat1-04:2017630:2017676 [7] NCCL INFO Setting affinity for GPU 7 to ff00,00000000,0000ff00,00000000
swat1-04:2017630:2017676 [7] NCCL INFO NVLS multicast support is not available on dev 7
swat1-04:2017630:2017672 [3] NCCL INFO NVLS multicast support is not available on dev 3
swat1-04:2017630:2017669 [0] NCCL INFO comm 0x8d57d40 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
swat1-04:2017630:2017670 [1] NCCL INFO comm 0x8d93f80 rank 1 nRanks 16 nNodes 2 localRanks 8 localRank 1 MNNVL 0
swat1-04:2017630:2017671 [2] NCCL INFO comm 0x8dd01c0 rank 2 nRanks 16 nNodes 2 localRanks 8 localRank 2 MNNVL 0
swat1-04:2017630:2017675 [6] NCCL INFO comm 0x8ec0ac0 rank 6 nRanks 16 nNodes 2 localRanks 8 localRank 6 MNNVL 0
swat1-04:2017630:2017672 [3] NCCL INFO comm 0x8e0c400 rank 3 nRanks 16 nNodes 2 localRanks 8 localRank 3 MNNVL 0
swat1-04:2017630:2017674 [5] NCCL INFO comm 0x8e84880 rank 5 nRanks 16 nNodes 2 localRanks 8 localRank 5 MNNVL 0
swat1-04:2017630:2017676 [7] NCCL INFO comm 0x8efcbc0 rank 7 nRanks 16 nNodes 2 localRanks 8 localRank 7 MNNVL 0
swat1-04:2017630:2017669 [0] NCCL INFO Channel 00/08 : 0 5 4 7 6 3 2 1 8 13 12 15 14 11 10 9
swat1-04:2017630:2017669 [0] NCCL INFO Channel 01/08 : 0 5 4 7 6 3 10 9 8 13 12 15 14 11 2 1
swat1-04:2017630:2017669 [0] NCCL INFO Channel 02/08 : 0 5 4 7 14 11 10 9 8 13 12 15 6 3 2 1
swat1-04:2017630:2017670 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0
swat1-04:2017630:2017670 [1] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017669 [0] NCCL INFO Channel 03/08 : 0 5 12 15 14 11 10 9 8 13 4 7 6 3 2 1
swat1-04:2017630:2017669 [0] NCCL INFO Channel 04/08 : 0 5 4 7 6 3 2 1 8 13 12 15 14 11 10 9
swat1-04:2017630:2017671 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/10/-1->2->-1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->10 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1
swat1-04:2017630:2017671 [2] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017673 [4] NCCL INFO comm 0x8e48640 rank 4 nRanks 16 nNodes 2 localRanks 8 localRank 4 MNNVL 0
swat1-04:2017630:2017669 [0] NCCL INFO Channel 05/08 : 0 5 4 7 6 3 10 9 8 13 12 15 14 11 2 1
swat1-04:2017630:2017669 [0] NCCL INFO Channel 06/08 : 0 5 4 7 14 11 10 9 8 13 12 15 6 3 2 1
swat1-04:2017630:2017669 [0] NCCL INFO Channel 07/08 : 0 5 12 15 14 11 10 9 8 13 4 7 6 3 2 1
swat1-04:2017630:2017669 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->5 [2] 1/-1/-1->0->5 [3] 1/-1/-1->0->5 [4] 1/-1/-1->0->8 [5] 1/-1/-1->0->5 [6] 1/-1/-1->0->5 [7] 1/-1/-1->0->5
swat1-04:2017630:2017669 [0] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017672 [3] NCCL INFO Trees [0] 6/-1/-1->3->2 [1] 6/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] 6/-1/-1->3->2 [4] 6/-1/-1->3->2 [5] 6/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] 6/-1/-1->3->2
swat1-04:2017630:2017672 [3] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017674 [5] NCCL INFO Trees [0] -1/-1/-1->5->4 [1] 0/-1/-1->5->4 [2] 0/-1/-1->5->4 [3] 0/-1/-1->5->4 [4] -1/-1/-1->5->4 [5] 0/-1/-1->5->4 [6] 0/-1/-1->5->4 [7] 0/-1/-1->5->4
swat1-04:2017630:2017674 [5] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017675 [6] NCCL INFO Trees [0] 7/-1/-1->6->3 [1] 7/-1/-1->6->3 [2] 7/14/-1->6->-1 [3] 7/-1/-1->6->3 [4] 7/-1/-1->6->3 [5] 7/-1/-1->6->3 [6] 7/-1/-1->6->14 [7] 7/-1/-1->6->3
swat1-04:2017630:2017675 [6] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017676 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6 [2] 4/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 4/-1/-1->7->6 [6] 4/-1/-1->7->6 [7] -1/-1/-1->7->6
swat1-04:2017630:2017676 [7] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017673 [4] NCCL INFO Trees [0] 5/-1/-1->4->7 [1] 5/-1/-1->4->7 [2] 5/-1/-1->4->7 [3] 5/12/-1->4->-1 [4] 5/-1/-1->4->7 [5] 5/-1/-1->4->7 [6] 5/-1/-1->4->7 [7] 5/-1/-1->4->12
swat1-04:2017630:2017673 [4] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017671 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017671 [2] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017674 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017674 [5] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017673 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017673 [4] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017669 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017669 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017669 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
swat1-04:2017630:2017675 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017675 [6] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017672 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017672 [3] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017670 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017670 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017676 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017676 [7] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017671 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
swat1-04:2017630:2017671 [2] NCCL INFO ncclCommInitRank comm 0x8dd01c0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 48000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017671 [2] NCCL INFO Init timings: rank 2 nranks 16 total 1.08 (kernels 0.51, bootstrap 0.22, allgathers 0.01, topo 0.28, graphs 0.04, connections 0.02, rest 0.01)
swat1-04:2017630:2017675 [6] NCCL INFO ncclCommInitRank comm 0x8ec0ac0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId c8000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017675 [6] NCCL INFO Init timings: rank 6 nranks 16 total 1.08 (kernels 0.50, bootstrap 0.22, allgathers 0.00, topo 0.28, graphs 0.04, connections 0.02, rest 0.00)
swat1-04:2017630:2017673 [4] NCCL INFO ncclCommInitRank comm 0x8e48640 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 88000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017669 [0] NCCL INFO ncclCommInitRank comm 0x8d57d40 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 7000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017669 [0] NCCL INFO Init timings: rank 0 nranks 16 total 1.08 (kernels 0.51, bootstrap 0.22, allgathers 0.00, topo 0.28, graphs 0.04, connections 0.02, rest 0.00)
swat1-04:2017630:2017670 [1] NCCL INFO ncclCommInitRank comm 0x8d93f80 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId b000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017674 [5] NCCL INFO ncclCommInitRank comm 0x8e84880 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId 8b000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017676 [7] NCCL INFO ncclCommInitRank comm 0x8efcbc0 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId cb000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017676 [7] NCCL INFO Init timings: rank 7 nranks 16 total 1.08 (kernels 0.51, bootstrap 0.22, allgathers 0.00, topo 0.28, graphs 0.04, connections 0.02, rest 0.00)
swat1-04:2017630:2017672 [3] NCCL INFO ncclCommInitRank comm 0x8e0c400 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 4c000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017672 [3] NCCL INFO Init timings: rank 3 nranks 16 total 1.08 (kernels 0.50, bootstrap 0.23, allgathers 0.00, topo 0.28, graphs 0.04, connections 0.02, rest 0.00)
swat1-04:2017630:2017673 [4] NCCL INFO Init timings: rank 4 nranks 16 total 1.08 (kernels 0.50, bootstrap 0.22, allgathers 0.02, topo 0.28, graphs 0.03, connections 0.02, rest 0.00)
swat1-04:2017630:2017670 [1] NCCL INFO Init timings: rank 1 nranks 16 total 1.08 (kernels 0.51, bootstrap 0.22, allgathers 0.00, topo 0.28, graphs 0.04, connections 0.02, rest 0.00)
swat1-04:2017630:2017674 [5] NCCL INFO Init timings: rank 5 nranks 16 total 1.08 (kernels 0.51, bootstrap 0.22, allgathers 0.02, topo 0.28, graphs 0.03, connections 0.02, rest 0.00)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
swat1-04:2017630:2017700 [4] NCCL INFO Channel 00/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 00/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 01/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 01/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 02/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 02/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 03/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 03/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 04/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 04/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 05/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 05/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 06/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 06/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 07/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [send] via NET/IB/1
swat1-04:2017630:2017704 [0] NCCL INFO Channel 07/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 04/0 : 1[1] -> 8[0] [send] via NET/IB/1
swat1-04:2017630:2017701 [3] NCCL INFO Channel 01/0 : 3[3] -> 10[2] [send] via NET/IB/0
swat1-04:2017630:2017698 [6] NCCL INFO Channel 02/0 : 15[7] -> 6[6] [receive] via NET/IB/2
swat1-04:2017630:2017701 [3] NCCL INFO Channel 05/0 : 3[3] -> 10[2] [send] via NET/IB/0
swat1-04:2017630:2017698 [6] NCCL INFO Channel 06/0 : 15[7] -> 6[6] [receive] via NET/IB/2
swat1-04:2017630:2017698 [6] NCCL INFO Channel 00/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017698 [6] NCCL INFO Channel 01/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017698 [6] NCCL INFO Channel 02/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017698 [6] NCCL INFO Channel 03/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017697 [7] NCCL INFO Channel 02/0 : 7[7] -> 14[6] [send] via NET/IB/2
swat1-04:2017630:2017704 [0] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [receive] via NET/IB/1
swat1-04:2017630:2017697 [7] NCCL INFO Channel 06/0 : 7[7] -> 14[6] [send] via NET/IB/2
swat1-04:2017630:2017704 [0] NCCL INFO Channel 04/0 : 9[1] -> 0[0] [receive] via NET/IB/1
swat1-04:2017630:2017698 [6] NCCL INFO Channel 04/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017697 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 03/0 : 5[5] -> 12[4] [send] via NET/IB/3
swat1-04:2017630:2017700 [4] NCCL INFO Channel 03/0 : 13[5] -> 4[4] [receive] via NET/IB/3
swat1-04:2017630:2017698 [6] NCCL INFO Channel 05/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 07/0 : 5[5] -> 12[4] [send] via NET/IB/3
swat1-04:2017630:2017700 [4] NCCL INFO Channel 07/0 : 13[5] -> 4[4] [receive] via NET/IB/3
swat1-04:2017630:2017697 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017698 [6] NCCL INFO Channel 06/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017697 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017698 [6] NCCL INFO Channel 07/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017697 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 01/0 : 11[3] -> 2[2] [receive] via NET/IB/0
swat1-04:2017630:2017697 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 05/0 : 11[3] -> 2[2] [receive] via NET/IB/0
swat1-04:2017630:2017699 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017697 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Connected all rings
swat1-04:2017630:2017700 [4] NCCL INFO Connected all rings
swat1-04:2017630:2017704 [0] NCCL INFO Connected all rings
swat1-04:2017630:2017703 [1] NCCL INFO Connected all rings
swat1-04:2017630:2017697 [7] NCCL INFO Connected all rings
swat1-04:2017630:2017698 [6] NCCL INFO Connected all rings
swat1-04:2017630:2017702 [2] NCCL INFO Connected all rings
swat1-04:2017630:2017701 [3] NCCL INFO Connected all rings
16777216 524288 half none -1 565.5 29.67 27.81 0 539.2 31.11 29.17 0
33554432 1048576 half none -1 933.6 35.94 33.70 0 921.9 36.40 34.12 0
67108864 2097152 half none -1 1783.1 37.64 35.28 0 1775.4 37.80 35.44 0
134217728 4194304 half none -1 3600.1 37.28 34.95 0 3583.7 37.45 35.11 0
268435456 8388608 half none -1 7231.3 37.12 34.80 0 7248.6 37.03 34.72 0
536870912 16777216 half none -1 14909 36.01 33.76 0 15052 35.67 33.44 0
1073741824 33554432 half none -1 31408 34.19 32.05 0 31294 34.31 32.17 0
swat1-04:2017630:2017630 [0] NCCL INFO comm 0x8d57d40 rank 0 nranks 16 cudaDev 0 busId 7000 - Destroy COMPLETE
swat1-04:2017630:2017630 [7] NCCL INFO comm 0x8efcbc0 rank 7 nranks 16 cudaDev 7 busId cb000 - Destroy COMPLETE
swat1-04:2017630:2017630 [6] NCCL INFO comm 0x8ec0ac0 rank 6 nranks 16 cudaDev 6 busId c8000 - Destroy COMPLETE
swat1-04:2017630:2017630 [5] NCCL INFO comm 0x8e84880 rank 5 nranks 16 cudaDev 5 busId 8b000 - Destroy COMPLETE
swat1-04:2017630:2017630 [4] NCCL INFO comm 0x8e48640 rank 4 nranks 16 cudaDev 4 busId 88000 - Destroy COMPLETE
swat1-04:2017630:2017630 [3] NCCL INFO comm 0x8e0c400 rank 3 nranks 16 cudaDev 3 busId 4c000 - Destroy COMPLETE
swat1-04:2017630:2017630 [2] NCCL INFO comm 0x8dd01c0 rank 2 nranks 16 cudaDev 2 busId 48000 - Destroy COMPLETE
swat1-04:2017630:2017630 [1] NCCL INFO comm 0x8d93f80 rank 1 nranks 16 cudaDev 1 busId b000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 33.3227
#
I don't see GDRDMA being enabled on those IB NICs. Is that to be expected? Have you loaded the nvidia-peermem module or used a DMA-BUF enabled GPU driver + Kernel?
I don't see GDRDMA being enabled on those IB NICs. Is that to be expected? Have you loaded the nvidia-peermem module or used a DMA-BUF enabled GPU driver + Kernel?
Sorry, I'm not familiar with GDRDMA, can you tell me how to enable GDRDMA?
But you also need to make sure ACS is not enabled unless you're using a VM environment.
But you also need to make sure ACS is not enabled unless you're using a VM environment.
After enable the GDRDMA, the IB bandwidth can reach 93GB/s w/ 1GB communication data. I think it's normal now. Thank you very much for you help.
Hi, I have similar problem with https://github.com/NVIDIA/nccl/issues/307, two machines in a cluster connected with 200Gb/sec bandwidth infiniband.
ibstatus:
ib_send_bw shows:
nvidia-smi topo -m shows:
but nccl-tests only achieves about 36GB/s, which is far below the expected bandwidth (each infiniband device bandwidth is 200Gb/sec, and we have 4 infiniband devices, so the expected bandwidth should be 200Gb/sec / 8 * 4 = 100GB/sec),
Command: mpirun --bind-to none --mca btl '^openib' -n 2 --host ip1,ip2 -x LD_LIBRARY_PATH ./build/all_gather_perf -b 16M -e 1024M -i 16777216 -g 8 -d half -f 2