NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

Although it is an InfiniBand environment, it seems that the average Bandwidth is not as good as expected. #182

Open gim4moon opened 9 months ago

gim4moon commented 9 months ago

root@testgpu1:/nccl-tests# mpirun -x NCCL_DEBUG=WARN -x NCCL_IB_HCA=mlx5_0 -x NCCL_NET_GDR_LEVEL=5 -x NCCL_SHM_DISABLE=1 -x NCCL_IB_MERGE_VFS=0 -x NCCL_IB_DISABLE=0 -np 2 --allow-run-as-root -H testgpu1,testgpu2 ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2 nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices Rank 0 Group 0 Pid 3591 on testgpu1 device 0 [0x03] NVIDIA A100-PCIE-40GB Rank 1 Group 0 Pid 3591 on testgpu1 device 1 [0x03] NVIDIA A100-PCIE-40GB Rank 0 Group 0 Pid 2451 on testgpu2 device 0 [0x03] NVIDIA A100-PCIE-40GB Rank 1 Group 0 Pid 2451 on testgpu2 device 1 [0x03] NVIDIA A100-PCIE-40GB NCCL version 2.19.3+cuda12.3 NCCL version 2.19.3+cuda12.3

                                                          out-of-place                       in-place
   size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)

                                                          out-of-place                       in-place
   size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
      8             2     float     sum      -1    52.70    0.00    0.00      0    56.24    0.00    0.00      0
      16             4     float     sum      -1    57.32    0.00    0.00      0    57.40    0.00    0.00      0
      32             8     float     sum      -1    56.71    0.00    0.00      0    61.03    0.00    0.00      0
      64            16     float     sum      -1    58.40    0.00    0.00      0    58.59    0.00    0.00      0
     128            32     float     sum      -1    57.44    0.00    0.00      0    57.11    0.00    0.00      0
     256            64     float     sum      -1    58.86    0.00    0.00      0    58.87    0.00    0.00      0
     512           128     float     sum      -1    60.39    0.01    0.01      0    59.77    0.01    0.01      0
    1024           256     float     sum      -1    57.90    0.02    0.02      0    57.87    0.02    0.02      0
    2048           512     float     sum      -1    57.01    0.04    0.04      0    60.72    0.03    0.03      0
    4096          1024     float     sum      -1    57.33    0.07    0.07      0    56.84    0.07    0.07      0
       8             2     float     sum      -1    56.58    0.00    0.00      0    54.30    0.00    0.00      0
    8192          2048     float     sum      -1    57.35    0.14    0.14      0    54.83    0.15    0.15      0
      16             4     float     sum      -1    56.80    0.00    0.00      0    55.59    0.00    0.00      0
      32             8     float     sum      -1    54.66    0.00    0.00      0    58.38    0.00    0.00      0
   16384          4096     float     sum      -1    61.67    0.27    0.27      0    62.07    0.26    0.26      0
      64            16     float     sum      -1    56.51    0.00    0.00      0    55.49    0.00    0.00      0
     128            32     float     sum      -1    56.41    0.00    0.00      0    55.03    0.00    0.00      0
   32768          8192     float     sum      -1    72.36    0.45    0.45      0    72.43    0.45    0.45      0
     256            64     float     sum      -1    55.59    0.00    0.00      0    55.77    0.00    0.00      0
     512           128     float     sum      -1    54.15    0.01    0.01      0    56.85    0.01    0.01      0
    1024           256     float     sum      -1    55.30    0.02    0.02      0    55.75    0.02    0.02      0
    2048           512     float     sum      -1    53.62    0.04    0.04      0    55.89    0.04    0.04      0
    4096          1024     float     sum      -1    55.86    0.07    0.07      0    57.41    0.07    0.07      0
   65536         16384     float     sum      -1    96.49    0.68    0.68      0    95.54    0.69    0.69      0
    8192          2048     float     sum      -1    62.53    0.13    0.13      0    59.75    0.14    0.14      0
   16384          4096     float     sum      -1    70.49    0.23    0.23      0    72.05    0.23    0.23      0
   32768          8192     float     sum      -1    107.9    0.30    0.30      0    110.6    0.30    0.30      0
  131072         32768     float     sum      -1    155.8    0.84    0.84      0    131.2    1.00    1.00      0
   65536         16384     float     sum      -1    215.7    0.30    0.30      0    221.6    0.30    0.30      0
  262144         65536     float     sum      -1    140.9    1.86    1.86      0    147.7    1.78    1.78      0
  131072         32768     float     sum      -1    396.0    0.33    0.33      0    410.8    0.32    0.32      0
  524288        131072     float     sum      -1    220.5    2.38    2.38      0    215.2    2.44    2.44      0
  262144         65536     float     sum      -1    148.3    1.77    1.77      0    149.0    1.76    1.76      0
 1048576        262144     float     sum      -1    339.7    3.09    3.09      0    313.4    3.35    3.35      0
  524288        131072     float     sum      -1    222.7    2.35    2.35      0    210.1    2.50    2.50      0
 1048576        262144     float     sum      -1    348.9    3.01    3.01      0    324.2    3.23    3.23      0
 2097152        524288     float     sum      -1    562.2    3.73    3.73      0    535.0    3.92    3.92      0
 2097152        524288     float     sum      -1    589.8    3.56    3.56      0    545.8    3.84    3.84      0
 4194304       1048576     float     sum      -1   1017.1    4.12    4.12      0    941.8    4.45    4.45      0
 4194304       1048576     float     sum      -1   1051.0    3.99    3.99      0   1018.6    4.12    4.12      0
 8388608       2097152     float     sum      -1   1852.5    4.53    4.53      0   1807.3    4.64    4.64      0
 8388608       2097152     float     sum      -1   1979.3    4.24    4.24      0   1960.5    4.28    4.28      0
16777216       4194304     float     sum      -1   3543.4    4.73    4.73      0   3545.2    4.73    4.73      0
16777216       4194304     float     sum      -1   3852.2    4.36    4.36      0   3842.9    4.37    4.37      0
33554432       8388608     float     sum      -1   6984.4    4.80    4.80      0   6989.8    4.80    4.80      0
33554432       8388608     float     sum      -1   7553.5    4.44    4.44      0   7543.9    4.45    4.45      0
67108864      16777216     float     sum      -1    13842    4.85    4.85      0    13892    4.83    4.83      0
67108864      16777216     float     sum      -1    14905    4.50    4.50      0    14912    4.50    4.50      0

134217728 33554432 float sum -1 27691 4.85 4.85 0 27634 4.86 4.86 0 134217728 33554432 float sum -1 29708 4.52 4.52 0 29694 4.52 4.52 0 268435456 67108864 float sum -1 55225 4.86 4.86 0 55242 4.86 4.86 0 Out of bounds values : 0 OK Avg bus bandwidth : 1.80135

268435456 67108864 float sum -1 59067 4.54 4.54 0 59082 4.54 4.54 0 Out of bounds values : 0 OK Avg bus bandwidth : 1.65866


We are conducting a test by configuring each virtual machine with 2 A100 HDR InfiniBand 1 EA pass-through. I'm asking because I don't think the speed is as fast as I expected.

Am I missing something?

AddyLaddy commented 9 months ago

That output doesn't look correct. Did you compile the nccl-tests with MPI=1 ?

Once you've corrected that it would be good to see logs with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH

gim4moon commented 9 months ago


Thanks for your reply!

After entering the information as instructed, we will share the results with you!

The results seem to be similar.

root@testgpu1:/mpi-nfs/nccl-tests# mpirun -x MPI=1 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -x NCCL_IB_HCA=mlx5_0 -x NCCL_NET_GDR_LEVEL=5 -x NCCL_SHM_DISABLE=1 -x NCCL_IB_MERGE_VFS=0 -x NCCL_IGNORE_DISABLED_P2P=1 -x NCCL_IB_DISABLE=0 -np 2 --allow-run-as-root -H testgpu1,testgpu2 ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2 nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 raph: 0

Using devices Rank 0 Group 0 Pid 3255 on testgpu2 device 0 [0x03] NVIDIA A100-PCIE-40GB Rank 1 Group 0 Pid 3255 on testgpu2 device 1 [0x03] NVIDIA A100-PCIE-40GB Rank 0 Group 0 Pid 2832 on testgpu1 device 0 [0x03] NVIDIA A100-PCIE-40GB Rank 1 Group 0 Pid 2832 on testgpu1 device 1 [0x03] NVIDIA A100-PCIE-40GB testgpu2:3255:3255 [0] NCCL INFO Bootstrap : Using ibs65:<0> testgpu2:3255:3255 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation testgpu1:2832:2832 [0] NCCL INFO Bootstrap : Using ibs65:<0> testgpu1:2832:2832 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation testgpu2:3255:3255 [1] NCCL INFO cudaDriverVersion 12030 NCCL version 2.19.3+cuda12.3 testgpu1:2832:2832 [1] NCCL INFO cudaDriverVersion 12030 NCCL version 2.19.3+cuda12.3 testgpu2:3255:3262 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. testgpu2:3255:3262 [1] NCCL INFO NCCL_IB_HCA set to mlx5_0 testgpu2:3255:3262 [1] NCCL INFO NCCL_IB_MERGE_VFS set by environment to 0. testgpu2:3255:3262 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs65:<0> testgpu2:3255:3262 [1] NCCL INFO Using non-device net plugin version 0 testgpu2:3255:3262 [1] NCCL INFO Using network IB testgpu2:3255:3261 [0] NCCL INFO Using non-device net plugin version 0 testgpu2:3255:3261 [0] NCCL INFO Using network IB testgpu2:3255:3261 [0] NCCL INFO comm 0x55bc161d6b40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3000 commId 0xe5eea11d31e0cffc - Init START testgpu2:3255:3262 [1] NCCL INFO comm 0x55bc161db6b0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3020 commId 0xe5eea11d31e0cffc - Init START testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_speed, ignoring testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_speed, ignoring testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_width, ignoring testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_width, ignoring testgpu1:2832:2839 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. testgpu1:2832:2839 [1] NCCL INFO NCCL_IB_HCA set to mlx5_0 testgpu1:2832:2839 [1] NCCL INFO NCCL_IB_MERGE_VFS set by environment to 0. testgpu1:2832:2839 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibs65:<0> testgpu1:2832:2839 [1] NCCL INFO Using non-device net plugin version 0 testgpu1:2832:2839 [1] NCCL INFO Using network IB testgpu1:2832:2838 [0] NCCL INFO Using non-device net plugin version 0 testgpu1:2832:2838 [0] NCCL INFO Using network IB testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_speed, ignoring testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_width, ignoring testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_speed, ignoring testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_width, ignoring testgpu1:2832:2838 [0] NCCL INFO comm 0x55c6f1c65b50 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3000 commId 0x5c31d9a008352187 - Init START testgpu1:2832:2839 [1] NCCL INFO comm 0x55c6f1c6a6c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3020 commId 0x5c31d9a008352187 - Init START testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_speed, ignoring testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_width, ignoring testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_speed, ignoring testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:00.0/../max_link_width, ignoring testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_speed, ignoring testgpu2:3255:3262 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_width, ignoring testgpu2:3255:3262 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu2:3255:3262 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu2:3255:3262 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu2:3255:3262 [1] NCCL INFO NCCL_IGNORE_DISABLED_P2P set by environment to 1. testgpu2:3255:3262 [1] NCCL INFO NCCL_SHM_DISABLE set by environment to 1. testgpu2:3255:3262 [1] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS testgpu2:3255:3262 [1] NCCL INFO === System : maxBw 12.0 totalBw 12.0 === testgpu2:3255:3262 [1] NCCL INFO CPU/0 (1/2/-1) testgpu2:3255:3262 [1] NCCL INFO + PCI[12.0] - NIC/3010 testgpu2:3255:3262 [1] NCCL INFO + NET[25.0] - NET/0 (c2ed3f0003ebc008/1/25.000000) testgpu2:3255:3262 [1] NCCL INFO + PCI[12.0] - GPU/3020 (1) testgpu2:3255:3262 [1] NCCL INFO ========================================== testgpu2:3255:3262 [1] NCCL INFO GPU/3020 :GPU/3020 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/12.000000/PHB) testgpu2:3255:3262 [1] NCCL INFO NET/0 :GPU/3020 (3/12.000000/PHB) CPU/0 (2/12.000000/PHB) NET/0 (0/5000.000000/LOC) testgpu2:3255:3262 [1] NCCL INFO Setting affinity for GPU 1 to 01 testgpu2:3255:3262 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu2:3255:3262 [1] NCCL INFO 0 : NET/0 GPU/1 NET/0 testgpu2:3255:3262 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 24.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu2:3255:3262 [1] NCCL INFO 0 : NET/0 GPU/1 NET/0 testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_speed, ignoring testgpu2:3255:3261 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_width, ignoring testgpu2:3255:3261 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu2:3255:3261 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu2:3255:3261 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_speed, ignoring testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_width, ignoring testgpu2:3255:3261 [0] NCCL INFO === System : maxBw 12.0 totalBw 12.0 === testgpu2:3255:3261 [0] NCCL INFO CPU/0 (1/2/-1) testgpu2:3255:3261 [0] NCCL INFO + PCI[12.0] - GPU/3000 (0) testgpu2:3255:3261 [0] NCCL INFO + PCI[12.0] - NIC/3010 testgpu2:3255:3261 [0] NCCL INFO + NET[25.0] - NET/0 (c2ed3f0003ebc008/1/25.000000) testgpu2:3255:3261 [0] NCCL INFO ========================================== testgpu2:3255:3261 [0] NCCL INFO GPU/3000 :GPU/3000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/12.000000/PHB) testgpu2:3255:3261 [0] NCCL INFO NET/0 :GPU/3000 (3/12.000000/PHB) CPU/0 (2/12.000000/PHB) NET/0 (0/5000.000000/LOC) testgpu2:3255:3261 [0] NCCL INFO Setting affinity for GPU 0 to 01 testgpu2:3255:3261 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu2:3255:3261 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0 testgpu2:3255:3261 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 24.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu2:3255:3261 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0 testgpu2:3255:3262 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 testgpu2:3255:3262 [1] NCCL INFO Tree 1 : -1 -> 1 -> 0/-1/-1 testgpu2:3255:3262 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 testgpu2:3255:3262 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 testgpu2:3255:3262 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 testgpu2:3255:3262 [1] NCCL INFO P2P Chunksize set to 131072 testgpu2:3255:3261 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 testgpu2:3255:3261 [0] NCCL INFO Tree 1 : 1 -> 0 -> -1/-1/-1 testgpu2:3255:3261 [0] NCCL INFO Channel 00/02 : 0 1 testgpu2:3255:3261 [0] NCCL INFO Channel 01/02 : 0 1 testgpu2:3255:3261 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 testgpu2:3255:3261 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 testgpu2:3255:3261 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 testgpu2:3255:3261 [0] NCCL INFO P2P Chunksize set to 131072 testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_speed, ignoring testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:02.0/../max_link_width, ignoring testgpu2:3255:3261 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA testgpu2:3255:3261 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA testgpu2:3255:3261 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA testgpu2:3255:3261 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] [send] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] [send] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO Connected all rings testgpu2:3255:3262 [1] NCCL INFO Connected all trees testgpu2:3255:3262 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 testgpu2:3255:3262 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer testgpu2:3255:3261 [0] NCCL INFO Connected all rings testgpu2:3255:3261 [0] NCCL INFO Connected all trees testgpu2:3255:3261 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 testgpu2:3255:3261 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_speed, ignoring testgpu1:2832:2839 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_width, ignoring testgpu1:2832:2839 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2839 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2839 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2839 [1] NCCL INFO NCCL_IGNORE_DISABLED_P2P set by environment to 1. testgpu1:2832:2839 [1] NCCL INFO NCCL_SHM_DISABLE set by environment to 1. testgpu1:2832:2839 [1] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS testgpu1:2832:2839 [1] NCCL INFO === System : maxBw 12.0 totalBw 12.0 === testgpu1:2832:2839 [1] NCCL INFO CPU/0 (1/2/-1) testgpu1:2832:2839 [1] NCCL INFO + PCI[12.0] - NIC/3010 testgpu1:2832:2839 [1] NCCL INFO + NET[25.0] - NET/0 (e6ea3f0003ebc008/1/25.000000) testgpu1:2832:2839 [1] NCCL INFO + PCI[12.0] - GPU/3020 (1) testgpu1:2832:2839 [1] NCCL INFO ========================================== testgpu1:2832:2839 [1] NCCL INFO GPU/3020 :GPU/3020 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/12.000000/PHB) testgpu1:2832:2839 [1] NCCL INFO NET/0 :GPU/3020 (3/12.000000/PHB) CPU/0 (2/12.000000/PHB) NET/0 (0/5000.000000/LOC) testgpu1:2832:2839 [1] NCCL INFO Setting affinity for GPU 1 to 01 testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_speed, ignoring testgpu1:2832:2838 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:03/0000:03:01.0/../max_link_width, ignoring testgpu1:2832:2839 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu1:2832:2839 [1] NCCL INFO 0 : NET/0 GPU/1 NET/0 testgpu1:2832:2839 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 24.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu1:2832:2838 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2838 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2838 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60 testgpu1:2832:2839 [1] NCCL INFO 0 : NET/0 GPU/1 NET/0 testgpu1:2832:2838 [0] NCCL INFO === System : maxBw 12.0 totalBw 12.0 === testgpu1:2832:2838 [0] NCCL INFO CPU/0 (1/2/-1) testgpu1:2832:2838 [0] NCCL INFO + PCI[12.0] - GPU/3000 (0) testgpu1:2832:2838 [0] NCCL INFO + PCI[12.0] - NIC/3010 testgpu1:2832:2838 [0] NCCL INFO + NET[25.0] - NET/0 (e6ea3f0003ebc008/1/25.000000) testgpu1:2832:2838 [0] NCCL INFO ========================================== testgpu1:2832:2838 [0] NCCL INFO GPU/3000 :GPU/3000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/12.000000/PHB) testgpu1:2832:2838 [0] NCCL INFO NET/0 :GPU/3000 (3/12.000000/PHB) CPU/0 (2/12.000000/PHB) NET/0 (0/5000.000000/LOC) testgpu1:2832:2838 [0] NCCL INFO Setting affinity for GPU 0 to 01 testgpu1:2832:2838 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu1:2832:2838 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0 testgpu1:2832:2838 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 24.000000/12.000000, type LOC/PHB, sameChannels 1 testgpu1:2832:2838 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0 testgpu1:2832:2838 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 testgpu1:2832:2838 [0] NCCL INFO Tree 1 : 1 -> 0 -> -1/-1/-1 testgpu1:2832:2839 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 testgpu1:2832:2839 [1] NCCL INFO Tree 1 : -1 -> 1 -> 0/-1/-1 testgpu1:2832:2839 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0 testgpu1:2832:2839 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0 testgpu1:2832:2839 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 testgpu1:2832:2839 [1] NCCL INFO P2P Chunksize set to 131072 testgpu1:2832:2838 [0] NCCL INFO Channel 00/02 : 0 1 testgpu1:2832:2838 [0] NCCL INFO Channel 01/02 : 0 1 testgpu1:2832:2838 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 testgpu1:2832:2838 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 testgpu1:2832:2838 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 testgpu1:2832:2838 [0] NCCL INFO P2P Chunksize set to 131072 testgpu1:2832:2838 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA testgpu1:2832:2838 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[0] [receive] via NET/IB/0/GDRDMA testgpu1:2832:2838 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA testgpu1:2832:2838 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [send] via NET/IB/0/GDRDMA testgpu1:2832:2839 [1] NCCL INFO Channel 00/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA testgpu1:2832:2839 [1] NCCL INFO Channel 01/0 : 0[0] -> 1[1] [receive] via NET/IB/0/GDRDMA testgpu1:2832:2839 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] [send] via NET/IB/0/GDRDMA testgpu1:2832:2839 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] [send] via NET/IB/0/GDRDMA testgpu2:3255:3262 [1] NCCL INFO comm 0x55bc161db6b0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3020 commId 0xe5eea11d31e0cffc - Init COMPLETE testgpu2:3255:3261 [0] NCCL INFO comm 0x55bc161d6b40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3000 commId 0xe5eea11d31e0cffc - Init COMPLETE

                                                          out-of-place                       in-place
   size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)

testgpu1:2832:2839 [1] NCCL INFO Connected all rings testgpu1:2832:2839 [1] NCCL INFO Connected all trees testgpu1:2832:2838 [0] NCCL INFO Connected all rings testgpu1:2832:2838 [0] NCCL INFO Connected all trees testgpu1:2832:2839 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 testgpu1:2832:2839 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer testgpu1:2832:2838 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 testgpu1:2832:2838 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer testgpu1:2832:2838 [0] NCCL INFO comm 0x55c6f1c65b50 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3000 commId 0x5c31d9a008352187 - Init COMPLETE testgpu1:2832:2839 [1] NCCL INFO comm 0x55c6f1c6a6c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 3020 commId 0x5c31d9a008352187 - Init COMPLETE

                                                          out-of-place                       in-place
   size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
       8             2     float     sum      -1    40.64    0.00    0.00      0    45.02    0.00    0.00      0
      16             4     float     sum      -1    49.66    0.00    0.00      0    48.21    0.00    0.00      0
      32             8     float     sum      -1    47.28    0.00    0.00      0    48.67    0.00    0.00      0
      64            16     float     sum      -1    47.85    0.00    0.00      0    47.38    0.00    0.00      0
     128            32     float     sum      -1    47.76    0.00    0.00      0    48.64    0.00    0.00      0
     256            64     float     sum      -1    48.23    0.01    0.01      0    47.77    0.01    0.01      0
       8             2     float     sum      -1    46.80    0.00    0.00      0    56.22    0.00    0.00      0
     512           128     float     sum      -1    47.44    0.01    0.01      0    48.22    0.01    0.01      0
      16             4     float     sum      -1    59.49    0.00    0.00      0    57.71    0.00    0.00      0
    1024           256     float     sum      -1    47.44    0.02    0.02      0    53.56    0.02    0.02      0
      32             8     float     sum      -1    56.84    0.00    0.00      0    58.08    0.00    0.00      0
    2048           512     float     sum      -1    50.01    0.04    0.04      0    48.32    0.04    0.04      0
      64            16     float     sum      -1    57.87    0.00    0.00      0    57.77    0.00    0.00      0
    4096          1024     float     sum      -1    47.37    0.09    0.09      0    47.15    0.09    0.09      0
     128            32     float     sum      -1    58.60    0.00    0.00      0    58.63    0.00    0.00      0
    8192          2048     float     sum      -1    47.34    0.17    0.17      0    48.44    0.17    0.17      0
     256            64     float     sum      -1    59.60    0.00    0.00      0    58.33    0.00    0.00      0
   16384          4096     float     sum      -1    55.38    0.30    0.30      0    55.06    0.30    0.30      0
     512           128     float     sum      -1    56.52    0.01    0.01      0    59.72    0.01    0.01      0
    1024           256     float     sum      -1    56.78    0.02    0.02      0    58.03    0.02    0.02      0
   32768          8192     float     sum      -1    77.52    0.42    0.42      0    82.21    0.40    0.40      0
    2048           512     float     sum      -1    58.29    0.04    0.04      0    56.97    0.04    0.04      0
    4096          1024     float     sum      -1    57.71    0.07    0.07      0    58.00    0.07    0.07      0
    8192          2048     float     sum      -1    60.02    0.14    0.14      0    59.03    0.14    0.14      0
   16384          4096     float     sum      -1    80.93    0.20    0.20      0    81.88    0.20    0.20      0
   65536         16384     float     sum      -1    156.4    0.42    0.42      0    157.0    0.42    0.42      0
   32768          8192     float     sum      -1    113.1    0.29    0.29      0    111.1    0.29    0.29      0
  131072         32768     float     sum      -1    306.9    0.43    0.43      0    311.4    0.42    0.42      0
   65536         16384     float     sum      -1    202.6    0.32    0.32      0    206.6    0.32    0.32      0
  131072         32768     float     sum      -1    379.7    0.35    0.35      0    358.5    0.37    0.37      0
  262144         65536     float     sum      -1    139.4    1.88    1.88      0    139.7    1.88    1.88      0
  524288        131072     float     sum      -1    210.6    2.49    2.49      0    201.7    2.60    2.60      0
  262144         65536     float     sum      -1    148.8    1.76    1.76      0    153.8    1.70    1.70      0
 1048576        262144     float     sum      -1    325.3    3.22    3.22      0    338.9    3.09    3.09      0
  524288        131072     float     sum      -1    212.0    2.47    2.47      0    217.1    2.41    2.41      0
 2097152        524288     float     sum      -1    565.9    3.71    3.71      0    556.9    3.77    3.77      0
 1048576        262144     float     sum      -1    339.0    3.09    3.09      0    307.6    3.41    3.41      0
 4194304       1048576     float     sum      -1   1082.5    3.87    3.87      0   1023.0    4.10    4.10      0
 2097152        524288     float     sum      -1    557.0    3.77    3.77      0    536.7    3.91    3.91      0
 4194304       1048576     float     sum      -1   1019.8    4.11    4.11      0    942.2    4.45    4.45      0
 8388608       2097152     float     sum      -1   2001.8    4.19    4.19      0   1978.8    4.24    4.24      0
 8388608       2097152     float     sum      -1   1869.2    4.49    4.49      0   1814.3    4.62    4.62      0
16777216       4194304     float     sum      -1   3863.2    4.34    4.34      0   3865.6    4.34    4.34      0
16777216       4194304     float     sum      -1   3564.1    4.71    4.71      0   3552.0    4.72    4.72      0
33554432       8388608     float     sum      -1   7574.1    4.43    4.43      0   7581.0    4.43    4.43      0
33554432       8388608     float     sum      -1   7022.9    4.78    4.78      0   7030.6    4.77    4.77      0
67108864      16777216     float     sum      -1    14979    4.48    4.48      0    14999    4.47    4.47      0
67108864      16777216     float     sum      -1    13870    4.84    4.84      0    13872    4.84    4.84      0

134217728 33554432 float sum -1 29770 4.51 4.51 0 29873 4.49 4.49 0 134217728 33554432 float sum -1 27666 4.85 4.85 0 27662 4.85 4.85 0 268435456 67108864 float sum -1 55270 4.86 4.86 0 55213 4.86 4.86 0 testgpu1:2832:2832 [1] NCCL INFO comm 0x55c6f1c65b50 rank 0 nranks 2 cudaDev 0 busId 3000 - Destroy COMPLETE testgpu1:2832:2832 [1] NCCL INFO comm 0x55c6f1c6a6c0 rank 1 nranks 2 cudaDev 1 busId 3020 - Destroy COMPLETE Out of bounds values : 0 OK Avg bus bandwidth : 1.7535

268435456 67108864 float sum -1 59440 4.52 4.52 0 59568 4.51 4.51 0 testgpu2:3255:3255 [1] NCCL INFO comm 0x55bc161d6b40 rank 0 nranks 2 cudaDev 0 busId 3000 - Destroy COMPLETE testgpu2:3255:3255 [1] NCCL INFO comm 0x55bc161db6b0 rank 1 nranks 2 cudaDev 1 busId 3020 - Destroy COMPLETE Out of bounds values : 0 OK Avg bus bandwidth : 1.6796

sjeaugey commented 9 months ago

The comment said "compile the nccl-tests with MPI=1". Not run. So please run make clean, then run make again with MPI=1. As instructed in the readme (you should probably read it again, it's not long).

gim4moon commented 9 months ago

NCCL_HOME=/path/to/nccl In which directory is usually installed?

I installed libnccl2 and libnccl-dev, but I can't find the directories.