NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

has nvswitch, but uses 0 nvls channels #228

Closed MiyazonoKaori closed 1 month ago

MiyazonoKaori commented 1 month ago

The host has nvlink and nvswitch, but when using nccl-tests, it displays 0 nvls channels and the bandwidth is only 10GB/s. How should I troubleshoot and repair ?

root@user:/home/nccl-tests-master# mpirun --allow-run-as-root -np 8 -x NCCL_DEBUG=INFO  ./build/all_reduce_perf -b 128M -e 4096M -f 2
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              user
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   user
  Local device: mlx5_0
--------------------------------------------------------------------------
# nThread 1 nGpus 1 minBytes 134217728 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   5179 on       user device  0 [0x27] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid   5180 on       user device  1 [0x2a] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid   5181 on       user device  2 [0x51] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid   5182 on       user device  3 [0x57] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid   5183 on       user device  4 [0x9e] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid   5184 on       user device  5 [0xa4] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid   5185 on       user device  6 [0xc7] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid   5186 on       user device  7 [0xca] NVIDIA A100-SXM4-80GB
user:5179:5179 [0] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5179:5179 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5179:5179 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5179:5179 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.1+cuda12.1
user:5180:5180 [1] NCCL INFO cudaDriverVersion 12020
user:5180:5180 [1] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5180:5180 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5180:5180 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5181:5181 [2] NCCL INFO cudaDriverVersion 12020
user:5181:5181 [2] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5181:5181 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5181:5181 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5182:5182 [3] NCCL INFO cudaDriverVersion 12020
user:5182:5182 [3] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5182:5182 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5182:5182 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5185:5185 [6] NCCL INFO cudaDriverVersion 12020
user:5185:5185 [6] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5185:5185 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5185:5185 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5186:5186 [7] NCCL INFO cudaDriverVersion 12020
user:5186:5186 [7] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5186:5186 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5186:5186 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5183:5183 [4] NCCL INFO cudaDriverVersion 12020
user:5184:5184 [5] NCCL INFO cudaDriverVersion 12020
user:5183:5183 [4] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5183:5183 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5183:5183 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5184:5184 [5] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5184:5184 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5184:5184 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
[user:05146] 7 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[user:05146] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[user:05146] 7 more processes have sent help message help-mpi-btl-openib.txt / error in device init
user:5179:5230 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5179:5230 [0] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5181:5232 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5181:5232 [2] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5179:5230 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5179:5230 [0] NCCL INFO Using network IB
user:5182:5233 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5182:5233 [3] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5186:5235 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5186:5235 [7] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5183:5236 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5183:5236 [4] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5181:5232 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5181:5232 [2] NCCL INFO Using network IB
user:5184:5237 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5184:5237 [5] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5182:5233 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5182:5233 [3] NCCL INFO Using network IB
user:5180:5231 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5180:5231 [1] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5186:5235 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5186:5235 [7] NCCL INFO Using network IB
user:5185:5234 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5185:5234 [6] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5183:5236 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5183:5236 [4] NCCL INFO Using network IB
user:5180:5231 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5180:5231 [1] NCCL INFO Using network IB
user:5184:5237 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5184:5237 [5] NCCL INFO Using network IB
user:5185:5234 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5185:5234 [6] NCCL INFO Using network IB
user:5184:5237 [5] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5182:5233 [3] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5184:5237 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
user:5184:5237 [5] NCCL INFO NVLS multicast support is not available on dev 5
user:5182:5233 [3] NCCL INFO NVLS multicast support is not available on dev 3
user:5186:5235 [7] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5186:5235 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
user:5186:5235 [7] NCCL INFO NVLS multicast support is not available on dev 7
user:5179:5230 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5179:5230 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
user:5179:5230 [0] NCCL INFO NVLS multicast support is not available on dev 0
user:5181:5232 [2] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5181:5232 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
user:5181:5232 [2] NCCL INFO NVLS multicast support is not available on dev 2
user:5180:5231 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5180:5231 [1] NCCL INFO NVLS multicast support is not available on dev 1
user:5185:5234 [6] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5185:5234 [6] NCCL INFO NVLS multicast support is not available on dev 6
user:5183:5236 [4] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5183:5236 [4] NCCL INFO NVLS multicast support is not available on dev 4
user:5179:5230 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
user:5179:5230 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
user:5179:5230 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
user:5179:5230 [0] NCCL INFO P2P Chunksize set to 131072
user:5184:5237 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
user:5184:5237 [5] NCCL INFO P2P Chunksize set to 131072
user:5180:5231 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
user:5180:5231 [1] NCCL INFO P2P Chunksize set to 131072
user:5181:5232 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
user:5181:5232 [2] NCCL INFO P2P Chunksize set to 131072
user:5185:5234 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
user:5185:5234 [6] NCCL INFO P2P Chunksize set to 131072
user:5186:5235 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
user:5186:5235 [7] NCCL INFO P2P Chunksize set to 131072
user:5182:5233 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
user:5182:5233 [3] NCCL INFO P2P Chunksize set to 131072
user:5183:5236 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
user:5183:5236 [4] NCCL INFO P2P Chunksize set to 131072
user:5184:5237 [5] NCCL INFO Channel 00 : 5[a4000] -> 6[c7000] via SHM/direct/direct
user:5186:5235 [7] NCCL INFO Channel 00 : 7[ca000] -> 0[27000] via SHM/direct/direct
user:5182:5233 [3] NCCL INFO Channel 00 : 3[57000] -> 4[9e000] via SHM/direct/direct
user:5186:5235 [7] NCCL INFO Channel 01 : 7[ca000] -> 0[27000] via SHM/direct/direct
user:5184:5237 [5] NCCL INFO Channel 01 : 5[a4000] -> 6[c7000] via SHM/direct/direct
user:5182:5233 [3] NCCL INFO Channel 01 : 3[57000] -> 4[9e000] via SHM/direct/direct
user:5179:5230 [0] NCCL INFO Channel 00 : 0[27000] -> 1[2a000] via SHM/direct/direct
user:5183:5236 [4] NCCL INFO Channel 00 : 4[9e000] -> 5[a4000] via SHM/direct/direct
user:5181:5232 [2] NCCL INFO Channel 00 : 2[51000] -> 3[57000] via SHM/direct/direct
user:5179:5230 [0] NCCL INFO Channel 01 : 0[27000] -> 1[2a000] via SHM/direct/direct
user:5180:5231 [1] NCCL INFO Channel 00 : 1[2a000] -> 2[51000] via SHM/direct/direct
user:5183:5236 [4] NCCL INFO Channel 01 : 4[9e000] -> 5[a4000] via SHM/direct/direct
user:5185:5234 [6] NCCL INFO Channel 00 : 6[c7000] -> 7[ca000] via SHM/direct/direct
user:5181:5232 [2] NCCL INFO Channel 01 : 2[51000] -> 3[57000] via SHM/direct/direct
user:5180:5231 [1] NCCL INFO Channel 01 : 1[2a000] -> 2[51000] via SHM/direct/direct
user:5185:5234 [6] NCCL INFO Channel 01 : 6[c7000] -> 7[ca000] via SHM/direct/direct
user:5184:5237 [5] NCCL INFO Connected all rings
user:5183:5236 [4] NCCL INFO Connected all rings
user:5186:5235 [7] NCCL INFO Connected all rings
user:5182:5233 [3] NCCL INFO Connected all rings
user:5186:5235 [7] NCCL INFO Channel 00 : 7[ca000] -> 6[c7000] via SHM/direct/direct
user:5186:5235 [7] NCCL INFO Channel 01 : 7[ca000] -> 6[c7000] via SHM/direct/direct
user:5180:5231 [1] NCCL INFO Connected all rings
user:5179:5230 [0] NCCL INFO Connected all rings
user:5181:5232 [2] NCCL INFO Connected all rings
user:5185:5234 [6] NCCL INFO Connected all rings
user:5180:5231 [1] NCCL INFO Channel 00 : 1[2a000] -> 0[27000] via SHM/direct/direct
user:5180:5231 [1] NCCL INFO Channel 01 : 1[2a000] -> 0[27000] via SHM/direct/direct
user:5183:5236 [4] NCCL INFO Channel 00 : 4[9e000] -> 3[57000] via SHM/direct/direct
user:5183:5236 [4] NCCL INFO Channel 01 : 4[9e000] -> 3[57000] via SHM/direct/direct
user:5182:5233 [3] NCCL INFO Channel 00 : 3[57000] -> 2[51000] via SHM/direct/direct
user:5184:5237 [5] NCCL INFO Channel 00 : 5[a4000] -> 4[9e000] via SHM/direct/direct
user:5184:5237 [5] NCCL INFO Channel 01 : 5[a4000] -> 4[9e000] via SHM/direct/direct
user:5182:5233 [3] NCCL INFO Channel 01 : 3[57000] -> 2[51000] via SHM/direct/direct
user:5185:5234 [6] NCCL INFO Channel 00 : 6[c7000] -> 5[a4000] via SHM/direct/direct
user:5185:5234 [6] NCCL INFO Channel 01 : 6[c7000] -> 5[a4000] via SHM/direct/direct
user:5181:5232 [2] NCCL INFO Channel 00 : 2[51000] -> 1[2a000] via SHM/direct/direct
user:5181:5232 [2] NCCL INFO Channel 01 : 2[51000] -> 1[2a000] via SHM/direct/direct
user:5179:5230 [0] NCCL INFO Connected all trees
user:5179:5230 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5179:5230 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5186:5235 [7] NCCL INFO Connected all trees
user:5186:5235 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5186:5235 [7] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5183:5236 [4] NCCL INFO Connected all trees
user:5183:5236 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5183:5236 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5181:5232 [2] NCCL INFO Connected all trees
user:5181:5232 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5181:5232 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5180:5231 [1] NCCL INFO Connected all trees
user:5180:5231 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5180:5231 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5182:5233 [3] NCCL INFO Connected all trees
user:5182:5233 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5182:5233 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5184:5237 [5] NCCL INFO Connected all trees
user:5184:5237 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5184:5237 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5185:5234 [6] NCCL INFO Connected all trees
user:5185:5234 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5185:5234 [6] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5184:5237 [5] NCCL INFO comm 0x564d7ea5fc30 rank 5 nranks 8 cudaDev 5 busId a4000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5186:5235 [7] NCCL INFO comm 0x55f350239b90 rank 7 nranks 8 cudaDev 7 busId ca000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5179:5230 [0] NCCL INFO comm 0x564058dcc010 rank 0 nranks 8 cudaDev 0 busId 27000 commId 0x854cf9285512ab36 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
user:5180:5231 [1] NCCL INFO comm 0x56295fb9c590 rank 1 nranks 8 cudaDev 1 busId 2a000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5183:5236 [4] NCCL INFO comm 0x5626b0ec9f90 rank 4 nranks 8 cudaDev 4 busId 9e000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5182:5233 [3] NCCL INFO comm 0x561edcb35a60 rank 3 nranks 8 cudaDev 3 busId 57000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5181:5232 [2] NCCL INFO comm 0x5586b2a03e40 rank 2 nranks 8 cudaDev 2 busId 51000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5185:5234 [6] NCCL INFO comm 0x55d06b211a20 rank 6 nranks 8 cudaDev 6 busId c7000 commId 0x854cf9285512ab36 - Init COMPLETE
   134217728      33554432     float     sum      -1    25061    5.36    9.37      0    25105    5.35    9.36      0
   268435456      67108864     float     sum      -1    50084    5.36    9.38      0    50148    5.35    9.37      0
   536870912     134217728     float     sum      -1   100124    5.36    9.38      0   100244    5.36    9.37      0

root@user:/home/nccl-tests-master# nvidia-smi
Tue Jun 25 11:06:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:27:00.0 Off |                    0 |
| N/A   29C    P0              56W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:2A:00.0 Off |                    0 |
| N/A   27C    P0              59W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:51:00.0 Off |                    0 |
| N/A   27C    P0              60W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:57:00.0 Off |                    0 |
| N/A   30C    P0              59W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:9E:00.0 Off |                    0 |
| N/A   29C    P0              56W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:A4:00.0 Off |                    0 |
| N/A   28C    P0              58W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:C7:00.0 Off |                    0 |
| N/A   26C    P0              57W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   30C    P0              59W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

root@user:/home/nccl-tests-master# nvidia-smi topo -m
    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB PXB PXB PXB SYS SYS 0-31,64-95  0       N/A
GPU1    NV12     X  NV12    NV12    NV12    NV12    NV12    NV12    PXB PXB PXB PXB SYS SYS 0-31,64-95  0       N/A
GPU2    NV12    NV12     X  NV12    NV12    NV12    NV12    NV12    SYS SYS SYS SYS SYS SYS 0-31,64-95  0       N/A
GPU3    NV12    NV12    NV12     X  NV12    NV12    NV12    NV12    SYS SYS SYS SYS SYS SYS 0-31,64-95  0       N/A
GPU4    NV12    NV12    NV12    NV12     X  NV12    NV12    NV12    SYS SYS SYS SYS SYS SYS 32-63,96-127    1       N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X  NV12    NV12    SYS SYS SYS SYS SYS SYS 32-63,96-127    1       N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X  NV12    SYS SYS SYS SYS SYS SYS 32-63,96-127    1       N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X  SYS SYS SYS SYS SYS SYS 32-63,96-127    1       N/A
NIC0    PXB PXB SYS SYS SYS SYS SYS SYS  X  PIX PXB PXB SYS SYS             
NIC1    PXB PXB SYS SYS SYS SYS SYS SYS PIX  X  PXB PXB SYS SYS             
NIC2    PXB PXB SYS SYS SYS SYS SYS SYS PXB PXB  X  PIX SYS SYS             
NIC3    PXB PXB SYS SYS SYS SYS SYS SYS PXB PXB PIX  X  SYS SYS             
NIC4    SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  PIX             
NIC5    SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX  X              

root@user:/home/nccl-tests-master# systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2024-06-25 10:22:43 UTC; 36min ago
    Process: 3418 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=0/SUCCESS)
   Main PID: 3429 (nv-fabricmanage)
      Tasks: 18 (limit: 629145)
     Memory: 21.4M
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─3429 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

6月 25 10:22:19 user systemd[1]: Starting NVIDIA fabric manager service...
6月 25 10:22:33 user nv-fabricmanager[3429]: Connected to 1 node.
6月 25 10:22:43 user nv-fabricmanager[3429]: Successfully configured all the available GPUs and NVSwitches to route NVLink traffic.
6月 25 10:22:43 user systemd[1]: Started NVIDIA fabric manager service.

root@user:/home/nccl-tests-master# nvidia-smi nvlink -s
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-ff8a40d4-abc6-08d1-f939-15848b5d4e05)
     Link 0: 25 GB/s
     Link 1: 25 GB/s
     Link 2: 25 GB/s
     Link 3: 25 GB/s
     Link 4: 25 GB/s
     Link 5: 25 GB/s
     Link 6: 25 GB/s
     Link 7: 25 GB/s
     Link 8: 25 GB/s
     Link 9: 25 GB/s
     Link 10: 25 GB/s
     Link 11: 25 GB/s
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-0c512562-871f-8257-6187-b5ae1b986d5e)
     Link 0: 25 GB/s
     Link 1: 25 GB/s
     Link 2: 25 GB/s
     Link 3: 25 GB/s
     Link 4: 25 GB/s
     Link 5: 25 GB/s
     Link 6: 25 GB/s
     Link 7: 25 GB/s
     Link 8: 25 GB/s
     Link 9: 25 GB/s
     Link 10: 25 GB/s
     Link 11: 25 GB/s
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-f4fb3d24-8773-b6b6-ae47-af7b37f5137d)
     Link 0: 25 GB/s
......
sjeaugey commented 1 month ago

NVLS was a new feature of H100. A100 GPUs do not support it.

kiskra-nvidia commented 1 month ago

Yes, NVLS won't be available on this platform, but NCCL should still be using regular NVLinks instead of SHM/direct... I see the following in the output:

NCCL_P2P_LEVEL set by environment to LOC

This forces P2P off. Please unset this variable and you should see a considerable speedup...

MiyazonoKaori commented 1 month ago

export NCCL_P2P_DISABLE=0, bandwidth has reached 220GB/s, thanks~