NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.23k stars 815 forks source link

NCCL test hangs between GPU2 and GPU3 all the time #606

Open lifengli137 opened 2 years ago

lifengli137 commented 2 years ago

Hi NCCL team,

I downloaded NCCL test code from GitHub and run on 4-GPU workstation.

The tests on pairs of GPUs (0-1, 0-2, 0-3, 1-2, 1-3) were run normally as expected; but hang all the time if tested between GPU2 and GPU3: CUDA_VISIBLE_DEVICES=2,3 ./all_reduce_perf -b 8 -e 16M -f 2 -g 2

The stack while hanging is as followings:

Thread 1 "all_reduce_perf" received signal SIGINT, Interrupt.
0x00007f637edaaef7 in sched_yield () at ../sysdeps/unix/syscall-template.S:78
78      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007f637edaaef7 in sched_yield () at ../sysdeps/unix/syscall-template.S:78
#1  0x000055798622648f in testStreamSynchronize (ngpus=2, streams=0x7ffdfb224cc0, comms=0x557988d687d0) at common.cu:478
#2  0x00005579862269e2 in completeColl (args=0x7ffdfb224ab0) at common.cu:520
#3  0x000055798622770e in TimeTest (args=0x7ffdfb224ab0, type=ncclFloat32, typeName=0x55798623211f "float", op=ncclSum, opName=0x55798623212c "sum", root=-1) at common.cu:696
#4  0x000055798622468c in AllReduceRunTest (args=0x7ffdfb224ab0, root=0, type=ncclFloat32, typeName=0x55798623211f "float", op=ncclSum, opName=0x55798623212c "sum")
    at all_reduce.cu:103
#5  0x0000557986227c23 in threadRunTests (args=0x7ffdfb224ab0) at common.cu:722
#6  0x000055798622a198 in run () at common.cu:1083
#7  0x000055798622883d in main (argc=9, argv=0x7ffdfb2267b8) at common.cu:925
(gdb)

Outputs of nvidia-smi:

Tue Dec  7 08:18:55 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:81:00.0 Off |                  Off |
| 30%   37C    P8    29W / 300W |      0MiB / 48685MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:82:00.0 Off |                  Off |
| 30%   30C    P8    21W / 300W |      0MiB / 48685MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000    Off  | 00000000:C1:00.0 Off |                  Off |
| 40%   68C    P2   117W / 300W |    366MiB / 48685MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    Off  | 00000000:C2:00.0 Off |                  Off |
| 44%   71C    P2   115W / 300W |    366MiB / 48685MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

System topology:

        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      PHB     SYS     NV4     0-127           N/A
GPU1    PHB      X      NV4     SYS     0-127           N/A
GPU2    SYS     NV4      X      PHB     0-127           N/A
GPU3    NV4     SYS     PHB      X      0-127           N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Outputs of p2pBandwidthLatencyTest:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 81, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A6000, pciBusID: 82, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA RTX A6000, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA RTX A6000, pciBusID: c2, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     1     1     1
     1       1     1     1     1
     2       1     1     1     1
     3       1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 651.04  16.05  19.82  17.87
     1  18.93 649.96  21.26  21.24
     2  21.36  21.29 649.96  19.03
     3  21.30  21.27  18.95 651.04
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 649.96  26.32  25.90  52.30
     1  26.33 655.41  52.30  24.57
     2  26.30  52.30 653.22  26.32
     3  52.21  24.87  26.30 653.22
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 666.03  22.35  34.59  37.60
     1  23.23 667.16  38.43  38.38
     2  38.47  38.39 669.45  23.68
     3  38.29  38.24  22.71 669.92
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 667.74  48.98  49.35  94.15
     1  51.12 667.16 100.47  49.38
     2  51.07 100.42 668.88  28.52
     3 100.39  47.47  27.81 668.25
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3
     0   1.64  21.76  11.43  20.17
     1  20.53   1.63  17.61  12.04
     2  15.91  14.67   1.66  11.33
     3  14.09  20.54  20.54   1.64

   CPU     0      1      2      3
     0   2.99   8.59   8.93   8.97
     1   9.93   3.29   8.90   8.92
     2   9.28   9.05   3.27   9.66
     3   9.79   9.75   9.62   3.27
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3
     0   1.64   1.53   1.69   1.44
     1   1.53   1.63   1.45   1.72
     2   1.73   1.43   1.61   1.59
     3   1.41   1.73   1.56   1.64

   CPU     0      1      2      3
     0   3.40   2.63   2.65   2.68
     1   2.72   3.38   2.72   2.79
     2   2.66   2.71   3.38   2.76
     3   2.78   2.70   2.74   3.37

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Is there any problem in workstation's hardware or it is a software error?

Thank you!

lifengli137 commented 2 years ago

The STDOUTs of NCCL_DEBUG=info CUDA_VISIBLE_DEVICES=2,3 ./all_reduce_perf -b 8 -e 16M -f 2 -g 2 while hanging is as followings:

# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid    225 on nccl-tests device  0 [0xc1] NVIDIA RTX A6000
#   Rank  1 Pid    225 on nccl-tests device  1 [0xc2] NVIDIA RTX A6000
nccl-tests:225:225 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.2<0>
nccl-tests:225:225 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:225:225 [0] NCCL INFO NET/IB : No device found.
nccl-tests:225:225 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.2<0>
nccl-tests:225:225 [0] NCCL INFO Using network Socket
NCCL version 2.8.2+cuda11.1
nccl-tests:225:240 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
nccl-tests:225:239 [0] NCCL INFO Channel 00/02 :    0   1
nccl-tests:225:239 [0] NCCL INFO Channel 01/02 :    0   1
nccl-tests:225:240 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:225:239 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
nccl-tests:225:239 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:225:240 [1] NCCL INFO Channel 00 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:225:239 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:225:240 [1] NCCL INFO Channel 01 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:225:239 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:225:240 [1] NCCL INFO Connected all rings
nccl-tests:225:240 [1] NCCL INFO Connected all trees
nccl-tests:225:239 [0] NCCL INFO Connected all rings
nccl-tests:225:240 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
nccl-tests:225:239 [0] NCCL INFO Connected all trees
nccl-tests:225:240 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
nccl-tests:225:239 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
nccl-tests:225:239 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
nccl-tests:225:239 [0] NCCL INFO comm 0x7fe5cabec530 rank 0 nranks 2 cudaDev 0 busId c1000 - Init COMPLETE
nccl-tests:225:240 [1] NCCL INFO comm 0x7fe5c0000dc0 rank 1 nranks 2 cudaDev 1 busId c2000 - Init COMPLETE
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
nccl-tests:225:225 [0] NCCL INFO Launch mode Group/CGMD
AddyLaddy commented 2 years ago

I can't think of any reason for that to hang. You're running quite an old version of NCCL though: 2.8.2 have you tried with the latest version 2.11.4 ? I'd also look to see if the hang is related to the message size, by using -b and -e to test individual message sizes such as 8, 4K, 64K, 128K, 1M etc. You could also see if it's related to a particular protocol by setting NCCL_PROTO=[LL|LL128|SIMPLE] in the environment

sjeaugey commented 2 years ago

Also you may want to check whether NCCL_P2P_DISABLE=1 solves the issue. If so, please verify that ACS is disabled.

sjeaugey commented 2 years ago

Actually I'm realizing you have NVLink and the NVLinks are not between 0-1 and 2-3, but instead between 0-3 and 1-2.

So the connection 2-3 is going through PHB. You should definitely try the latest NCCL (2.8.2 isn't the last version of the 2.8 series so it has known bugs). If that doesn't fix it, then you should try NCCL_P2P_LEVEL=PXB and see if it works better. If it does, that means P2P is broken through the CPU, which could be due to a variety of factors, including VT-d.

lifengli137 commented 2 years ago

Thank you @AddyLaddy @sjeaugey , I did following tests:

# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  26327 on nccl-tests device  0 [0xc1] NVIDIA RTX A6000
#   Rank  1 Pid  26327 on nccl-tests device  1 [0xc2] NVIDIA RTX A6000
nccl-tests:26327:26327 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.2<0>
nccl-tests:26327:26327 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:26327:26327 [0] NCCL INFO NET/IB : No device found.
nccl-tests:26327:26327 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.2<0>
nccl-tests:26327:26327 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.1
nccl-tests:26327:26342 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
nccl-tests:26327:26341 [0] NCCL INFO Channel 00/04 :    0   1
nccl-tests:26327:26341 [0] NCCL INFO Channel 01/04 :    0   1
nccl-tests:26327:26341 [0] NCCL INFO Channel 02/04 :    0   1
nccl-tests:26327:26341 [0] NCCL INFO Channel 03/04 :    0   1
nccl-tests:26327:26342 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:26327:26341 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
nccl-tests:26327:26341 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:26327:26341 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:26327:26342 [1] NCCL INFO Channel 00 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:26327:26341 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:26327:26342 [1] NCCL INFO Channel 01 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:26327:26341 [0] NCCL INFO Channel 02 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:26327:26342 [1] NCCL INFO Channel 02 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:26327:26341 [0] NCCL INFO Channel 03 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:26327:26342 [1] NCCL INFO Channel 03 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:26327:26342 [1] NCCL INFO Connected all rings
nccl-tests:26327:26341 [0] NCCL INFO Connected all rings
nccl-tests:26327:26342 [1] NCCL INFO Connected all trees
nccl-tests:26327:26341 [0] NCCL INFO Connected all trees
nccl-tests:26327:26342 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
nccl-tests:26327:26342 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
nccl-tests:26327:26341 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
nccl-tests:26327:26341 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
nccl-tests:26327:26342 [1] NCCL INFO comm 0x7fa644000f60 rank 1 nranks 2 cudaDev 1 busId c2000 - Init COMPLETE
nccl-tests:26327:26341 [0] NCCL INFO comm 0x7fa650000f60 rank 0 nranks 2 cudaDev 0 busId c1000 - Init COMPLETE
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
nccl-tests:26327:26327 [0] NCCL INFO Launch mode Parallel
# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid    245 on nccl-tests device  0 [0xc1] NVIDIA RTX A6000
#   Rank  1 Pid    245 on nccl-tests device  1 [0xc2] NVIDIA RTX A6000
nccl-tests:245:245 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.2<0>
nccl-tests:245:245 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:245:245 [0] NCCL INFO NET/IB : No device found.
nccl-tests:245:245 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.2<0>
nccl-tests:245:245 [0] NCCL INFO Using network Socket
NCCL version 2.8.2+cuda11.1
nccl-tests:245:259 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
nccl-tests:245:260 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
nccl-tests:245:259 [0] NCCL INFO Channel 00/02 :    0   1
nccl-tests:245:260 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:245:259 [0] NCCL INFO Channel 01/02 :    0   1
nccl-tests:245:259 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
nccl-tests:245:259 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:245:259 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c2000] via direct shared memory
nccl-tests:245:259 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c2000] via direct shared memory
nccl-tests:245:260 [1] NCCL INFO Channel 00 : 1[c2000] -> 0[c1000] via direct shared memory
nccl-tests:245:260 [1] NCCL INFO Channel 01 : 1[c2000] -> 0[c1000] via direct shared memory
nccl-tests:245:260 [1] NCCL INFO Connected all rings
nccl-tests:245:260 [1] NCCL INFO Connected all trees
nccl-tests:245:260 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
nccl-tests:245:259 [0] NCCL INFO Connected all rings
nccl-tests:245:260 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
nccl-tests:245:259 [0] NCCL INFO Connected all trees
nccl-tests:245:259 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
nccl-tests:245:259 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
nccl-tests:245:259 [0] NCCL INFO comm 0x7f2e62bec5d0 rank 0 nranks 2 cudaDev 0 busId c1000 - Init COMPLETE
nccl-tests:245:260 [1] NCCL INFO comm 0x7f2e54000dc0 rank 1 nranks 2 cudaDev 1 busId c2000 - Init COMPLETE
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
nccl-tests:245:245 [0] NCCL INFO Launch mode Group/CGMD
           8             2     float     sum     8.17    0.00    0.00  0e+00     8.19    0.00    0.00  0e+00
          16             4     float     sum     8.22    0.00    0.00  0e+00     8.27    0.00    0.00  0e+00
          32             8     float     sum     8.17    0.00    0.00  0e+00     8.16    0.00    0.00  0e+00
          64            16     float     sum     8.67    0.01    0.01  0e+00     8.21    0.01    0.01  0e+00
         128            32     float     sum     8.15    0.02    0.02  0e+00     8.31    0.02    0.02  0e+00
         256            64     float     sum     8.20    0.03    0.03  0e+00     8.14    0.03    0.03  0e+00
         512           128     float     sum     8.14    0.06    0.06  0e+00     8.17    0.06    0.06  0e+00
        1024           256     float     sum     8.25    0.12    0.12  0e+00     8.21    0.12    0.12  0e+00
        2048           512     float     sum     8.35    0.25    0.25  0e+00     8.18    0.25    0.25  0e+00
        4096          1024     float     sum     8.71    0.47    0.47  0e+00     8.72    0.47    0.47  0e+00
        8192          2048     float     sum    10.33    0.79    0.79  0e+00    10.26    0.80    0.80  0e+00
       16384          4096     float     sum    14.42    1.14    1.14  0e+00    14.38    1.14    1.14  0e+00
       32768          8192     float     sum    20.12    1.63    1.63  0e+00    20.00    1.64    1.64  0e+00
       65536         16384     float     sum    31.52    2.08    2.08  0e+00    31.91    2.05    2.05  0e+00
      131072         32768     float     sum    46.36    2.83    2.83  0e+00    46.33    2.83    2.83  0e+00
      262144         65536     float     sum    73.91    3.55    3.55  0e+00    75.32    3.48    3.48  0e+00
      524288        131072     float     sum    143.7    3.65    3.65  0e+00    140.2    3.74    3.74  0e+00
     1048576        262144     float     sum    275.4    3.81    3.81  0e+00    280.4    3.74    3.74  0e+00
     2097152        524288     float     sum    549.0    3.82    3.82  0e+00    552.0    3.80    3.80  0e+00
     4194304       1048576     float     sum   1082.6    3.87    3.87  0e+00   1064.3    3.94    3.94  0e+00
     8388608       2097152     float     sum   2119.3    3.96    3.96  0e+00   2119.0    3.96    3.96  0e+00
    16777216       4194304     float     sum   4513.5    3.72    3.72  0e+00   4489.3    3.74    3.74  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.62784
#
# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  26383 on nccl-tests device  0 [0xc1] NVIDIA RTX A6000
#   Rank  1 Pid  26383 on nccl-tests device  1 [0xc2] NVIDIA RTX A6000
nccl-tests:26383:26383 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.2<0>
nccl-tests:26383:26383 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:26383:26383 [0] NCCL INFO NET/IB : No device found.
nccl-tests:26383:26383 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.2<0>
nccl-tests:26383:26383 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.1
nccl-tests:26383:26397 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
nccl-tests:26383:26398 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
nccl-tests:26383:26397 [0] NCCL INFO Channel 00/04 :    0   1
nccl-tests:26383:26397 [0] NCCL INFO Channel 01/04 :    0   1
nccl-tests:26383:26398 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:26383:26397 [0] NCCL INFO Channel 02/04 :    0   1
nccl-tests:26383:26397 [0] NCCL INFO Channel 03/04 :    0   1
nccl-tests:26383:26397 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
nccl-tests:26383:26397 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff

nccl-tests:26383:26397 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
nccl-tests:26383:26397 [0] NCCL INFO include/shm.h:41 -> 2

nccl-tests:26383:26398 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
nccl-tests:26383:26398 [1] NCCL INFO include/shm.h:41 -> 2

nccl-tests:26383:26397 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-20db55a2180732ae-0-1-0 (size 9637888)

nccl-tests:26383:26398 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-20db55a2180732ae-0-0-1 (size 9637888)
nccl-tests:26383:26397 [0] NCCL INFO transport/shm.cc:100 -> 2
nccl-tests:26383:26398 [1] NCCL INFO transport/shm.cc:100 -> 2
nccl-tests:26383:26397 [0] NCCL INFO transport.cc:34 -> 2
nccl-tests:26383:26398 [1] NCCL INFO transport.cc:34 -> 2
nccl-tests:26383:26397 [0] NCCL INFO transport.cc:87 -> 2
nccl-tests:26383:26398 [1] NCCL INFO transport.cc:87 -> 2
nccl-tests:26383:26397 [0] NCCL INFO init.cc:804 -> 2
nccl-tests:26383:26398 [1] NCCL INFO init.cc:804 -> 2
nccl-tests:26383:26397 [0] NCCL INFO init.cc:941 -> 2
nccl-tests:26383:26398 [1] NCCL INFO init.cc:941 -> 2
nccl-tests:26383:26397 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
nccl-tests:26383:26398 [1] NCCL INFO group.cc:72 -> 2 [Async thread]
nccl-tests:26383:26383 [1] NCCL INFO init.cc:1010 -> 2
nccl-tests: Test NCCL failure common.cu:1098 'unhandled system error'
 .. nccl-tests pid 26383: Test failure common.cu:1005

image

image

lifengli137 commented 2 years ago

Even with those BIOS settings mentioned above, there are still many ACSCtl: SrcValid+ in the STDOUTs of sudo lspci -vvv | grep ACSCtl by following with this link

Since the machine uses AMD PCIe switch, the workaround, mentioned in the link, for Broadcom PLX is not applicable.

$ sudo lspci -vvv | grep ACSCtl
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
lifengli137 commented 2 years ago

I will forward this open issue to hardware team.

sjeaugey commented 2 years ago

It seems shared memory creation failed on 2.11.4 because /dev/shm is full.

Call to posix_fallocate failed : No space left on device

Anyway, I'll assume shared memory works, but P2P does not work between GPUs 2 and 3. Did you try disabling IOMMU in the BIOS or pass iommu=pt to the linux kernel cmdline?

AddyLaddy commented 2 years ago

You can disable ACS on all devices that support it by running a script such as this one:

disable_acs.sh.txt

lifengli137 commented 2 years ago

It seems shared memory creation failed on 2.11.4 because /dev/shm is full.

Call to posix_fallocate failed : No space left on device

Anyway, I'll assume shared memory works, but P2P does not work between GPUs 2 and 3. Did you try disabling IOMMU in the BIOS or pass iommu=pt to the linux kernel cmdline?

Thank you very much @sjeaugey! I will get back to this shared memory issue after the P2P issue resolved.

lifengli137 commented 2 years ago

You can disable ACS on all devices that support it by running a script such as this one:

disable_acs.sh.txt

Thank you @AddyLaddy!

The following is the STDOUTs of the script you posted:

sudo bash ./acs.sh
0000:00:07.1 (ecap 000d @2a0) @2a6 0000
0000:00:08.1 (ecap 000d @2a0) @2a6 0000
0000:03:00.0 (ecap 000d @2a0) @2a6 0000
0000:03:00.2 (ecap 000d @2a0) @2a6 0000
0000:04:00.0 (ecap 000d @2a0) @2a6 0000
0000:04:00.2 (ecap 000d @2a0) @2a6 0000
0000:04:00.3 (ecap 000d @2a0) @2a6 0000
0000:40:07.1 (ecap 000d @2a0) @2a6 0000
0000:40:08.1 (ecap 000d @2a0) @2a6 0000
0000:40:08.2 (ecap 000d @2a0) @2a6 0000
0000:40:08.3 (ecap 000d @2a0) @2a6 0000
0000:47:00.0 (ecap 000d @2a0) @2a6 0000
0000:47:00.2 (ecap 000d @2a0) @2a6 0000
0000:48:00.0 (ecap 000d @2a0) @2a6 0000
0000:48:00.1 (ecap 000d @2a0) @2a6 0000
0000:48:00.2 (ecap 000d @2a0) @2a6 0000
0000:48:00.3 (ecap 000d @2a0) @2a6 0000
0000:49:00.0 (ecap 000d @2a0) @2a6 0000
0000:4a:00.0 (ecap 000d @2a0) @2a6 0000
0000:80:07.1 (ecap 000d @2a0) @2a6 0000
0000:80:08.1 (ecap 000d @2a0) @2a6 0000
0000:80:08.2 (ecap 000d @2a0) @2a6 0000
0000:80:08.3 (ecap 000d @2a0) @2a6 0000
0000:83:00.0 (ecap 000d @2a0) @2a6 0000
0000:83:00.2 (ecap 000d @2a0) @2a6 0000
0000:84:00.0 (ecap 000d @2a0) @2a6 0000
0000:84:00.2 (ecap 000d @2a0) @2a6 0000
0000:85:00.0 (ecap 000d @2a0) @2a6 0000
0000:86:00.0 (ecap 000d @2a0) @2a6 0000
0000:c0:07.1 (ecap 000d @2a0) @2a6 0000
0000:c0:08.1 (ecap 000d @2a0) @2a6 0000
0000:c3:00.0 (ecap 000d @2a0) @2a6 0000
0000:c3:00.2 (ecap 000d @2a0) @2a6 0000
0000:c4:00.0 (ecap 000d @2a0) @2a6 0000
0000:c4:00.2 (ecap 000d @2a0) @2a6 0000

After executed it, it seemed that all the SrcValid+ became SrcValid-; the STDOUTs of sudo lspci -vvv | grep ACSCtl is as followings:

sudo lspci -vvv | grep ACSCtl
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

However, the NCCL-test is still hanging.

lifengli137 commented 2 years ago

It seems shared memory creation failed on 2.11.4 because /dev/shm is full.

Call to posix_fallocate failed : No space left on device

Anyway, I'll assume shared memory works, but P2P does not work between GPUs 2 and 3. Did you try disabling IOMMU in the BIOS or pass iommu=pt to the linux kernel cmdline?

Hi @sjeaugey, we disabled the IOMMU in the BIOS. The issue Call to posix_fallocate failed : No space left on device also persisted:

image

$ NCCL_P2P_DISABLE=1 LD_LIBRARY_PATH=/opt/nccl/build/lib/ CUDA_VISIBLE_DEVICES=2,3 NCCL_DEBUG=INFO /opt/nccl-tests/build/all_reduce_perf -b 8 -e 16M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid    286 on nccl-tests device  0 [0xc1] NVIDIA RTX A6000
#   Rank  1 Pid    286 on nccl-tests device  1 [0xc2] NVIDIA RTX A6000
nccl-tests:286:286 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.2<0>
nccl-tests:286:286 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:286:286 [0] NCCL INFO NET/IB : No device found.
nccl-tests:286:286 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.2<0>
nccl-tests:286:286 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.1
nccl-tests:286:301 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
nccl-tests:286:300 [0] NCCL INFO Channel 00/04 :    0   1
nccl-tests:286:301 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
nccl-tests:286:300 [0] NCCL INFO Channel 01/04 :    0   1
nccl-tests:286:300 [0] NCCL INFO Channel 02/04 :    0   1
nccl-tests:286:300 [0] NCCL INFO Channel 03/04 :    0   1
nccl-tests:286:300 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
nccl-tests:286:300 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:286:301 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff

nccl-tests:286:300 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
nccl-tests:286:300 [0] NCCL INFO include/shm.h:41 -> 2

nccl-tests:286:300 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-e3c757a8fd737043-3-1-0 (size 9637888)
nccl-tests:286:300 [0] NCCL INFO transport/shm.cc:100 -> 2
nccl-tests:286:300 [0] NCCL INFO transport.cc:34 -> 2
nccl-tests:286:300 [0] NCCL INFO transport.cc:87 -> 2
nccl-tests:286:300 [0] NCCL INFO init.cc:804 -> 2
nccl-tests:286:300 [0] NCCL INFO init.cc:941 -> 2
nccl-tests:286:300 [0] NCCL INFO group.cc:72 -> 2 [Async thread]

nccl-tests:286:301 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
nccl-tests:286:301 [1] NCCL INFO include/shm.h:41 -> 2

nccl-tests:286:301 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-e3c757a8fd737043-3-0-1 (size 9637888)
nccl-tests:286:301 [1] NCCL INFO transport/shm.cc:100 -> 2
nccl-tests:286:301 [1] NCCL INFO transport.cc:34 -> 2
nccl-tests:286:301 [1] NCCL INFO transport.cc:87 -> 2
nccl-tests:286:301 [1] NCCL INFO init.cc:804 -> 2
nccl-tests:286:301 [1] NCCL INFO init.cc:941 -> 2
nccl-tests:286:301 [1] NCCL INFO group.cc:72 -> 2 [Async thread]
nccl-tests:286:286 [1] NCCL INFO init.cc:1010 -> 2
nccl-tests: Test NCCL failure common.cu:1017 'unhandled system error'
 .. nccl-tests pid 286: Test failure common.cu:925
sjeaugey commented 2 years ago

Iommu off is meant to make P2P work again. So you should try to remove NCCL_P2P_DISABLE=1.

To solve the shared memory creation issue you need to make sure /dev/shm has enough space.

lifengli137 commented 2 years ago

Thank you @sjeaugey, after increasing the size of /dev/shm, the NCCL-test with NCCL_P2P_DISABLE=1 was working, as shown as followings:

$ df -h

Filesystem      Size  Used Avail Use% Mounted on                      
overlay         1.8T   96G  1.6T   6% /                               
tmpfs            64M     0   64M   0% /dev                            
tmpfs           498G     0  498G   0% /sys/fs/cgroup                  
shm             1.0G     0  1.0G   0% /dev/shm                        
/dev/nvme0n1p2  1.8T   96G  1.6T   6% /etc/hosts                      
tmpfs           498G   12K  498G   1% /proc/driver/nvidia             
tmpfs           100G   32M  100G   1% /run/nvidia-persistenced/socket 
udev            498G     0  498G   0% /dev/nvidia2                    
$ LD_LIBRARY_PATH=/opt/nccl/build/lib/ CUDA_VISIBLE_DEVICES=2,3 NCCL_P2P_DISABLE=1 NCCL_DEBUG=INFO /opt/nccl-tests/build/all_reduce_perf -b 8 -e 16M -f 2 -g 2

# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1            
#                                                                                                                   
# Using devices                                                                                                     
#   Rank  0 Pid    233 on nccl-tests device  0 [0xc1] NVIDIA RTX A6000                                              
#   Rank  1 Pid    233 on nccl-tests device  1 [0xc2] NVIDIA RTX A6000                                              
nccl-tests:233:233 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.15<0>                                          
nccl-tests:233:233 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation       
nccl-tests:233:233 [0] NCCL INFO NET/IB : No device found.                                                          
nccl-tests:233:233 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.15<0>                                      
nccl-tests:233:233 [0] NCCL INFO Using network Socket                                                               
NCCL version 2.11.4+cuda11.1                                                                                        
nccl-tests:233:247 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC                                           
nccl-tests:233:248 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1  
nccl-tests:233:247 [0] NCCL INFO Channel 00/04 :    0   1                                                           
nccl-tests:233:247 [0] NCCL INFO Channel 01/04 :    0   1                                                           
nccl-tests:233:248 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff                  
nccl-tests:233:247 [0] NCCL INFO Channel 02/04 :    0   1                                                           
nccl-tests:233:247 [0] NCCL INFO Channel 03/04 :    0   1                                                           
nccl-tests:233:247 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1  
nccl-tests:233:247 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff                  
nccl-tests:233:248 [1] NCCL INFO Channel 00 : 1[c2000] -> 0[c1000] via direct shared memory                         
nccl-tests:233:248 [1] NCCL INFO Channel 01 : 1[c2000] -> 0[c1000] via direct shared memory                         
nccl-tests:233:247 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c2000] via direct shared memory                         
nccl-tests:233:248 [1] NCCL INFO Channel 02 : 1[c2000] -> 0[c1000] via direct shared memory                         
nccl-tests:233:247 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c2000] via direct shared memory                         
nccl-tests:233:248 [1] NCCL INFO Channel 03 : 1[c2000] -> 0[c1000] via direct shared memory                         
nccl-tests:233:247 [0] NCCL INFO Channel 02 : 0[c1000] -> 1[c2000] via direct shared memory                         
nccl-tests:233:247 [0] NCCL INFO Channel 03 : 0[c1000] -> 1[c2000] via direct shared memory                         
nccl-tests:233:247 [0] NCCL INFO Connected all rings                                                                
nccl-tests:233:247 [0] NCCL INFO Connected all trees                                                                
nccl-tests:233:248 [1] NCCL INFO Connected all rings                                                                
nccl-tests:233:247 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512                                        
nccl-tests:233:248 [1] NCCL INFO Connected all trees                                                                
nccl-tests:233:247 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer                           
nccl-tests:233:248 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512                                        
nccl-tests:233:248 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer                           
nccl-tests:233:247 [0] NCCL INFO comm 0x7fec50000f60 rank 0 nranks 2 cudaDev 0 busId c1000 - Init COMPLETE          
nccl-tests:233:248 [1] NCCL INFO comm 0x7fec48000f60 rank 1 nranks 2 cudaDev 1 busId c2000 - Init COMPLETE          
#                                                                                                                   
#                                                       out-of-place                       in-place                 
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error        
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)               
nccl-tests:233:233 [0] NCCL INFO Launch mode Parallel                                                               
           8             2     float     sum     7.76    0.00    0.00  0e+00     7.79    0.00    0.00  0e+00        
          16             4     float     sum     7.79    0.00    0.00  0e+00     7.75    0.00    0.00  0e+00        
          32             8     float     sum     8.09    0.00    0.00  0e+00     7.87    0.00    0.00  0e+00        
          64            16     float     sum     7.84    0.01    0.01  0e+00     8.26    0.01    0.01  0e+00        
         128            32     float     sum     7.95    0.02    0.02  0e+00     7.91    0.02    0.02  0e+00        
         256            64     float     sum     7.87    0.03    0.03  0e+00     7.85    0.03    0.03  0e+00        
         512           128     float     sum     7.83    0.07    0.07  0e+00     7.81    0.07    0.07  0e+00        
        1024           256     float     sum     8.19    0.13    0.13  0e+00     7.97    0.13    0.13  0e+00        
        2048           512     float     sum     7.98    0.26    0.26  0e+00     7.97    0.26    0.26  0e+00        
        4096          1024     float     sum     8.43    0.49    0.49  0e+00     8.49    0.48    0.48  0e+00        
        8192          2048     float     sum     9.94    0.82    0.82  0e+00    10.11    0.81    0.81  0e+00        
       16384          4096     float     sum    14.17    1.16    1.16  0e+00    13.96    1.17    1.17  0e+00        
       32768          8192     float     sum    20.96    1.56    1.56  0e+00    21.16    1.55    1.55  0e+00        
       65536         16384     float     sum    35.67    1.84    1.84  0e+00    36.83    1.78    1.78  0e+00        
      131072         32768     float     sum    48.87    2.68    2.68  0e+00    48.07    2.73    2.73  0e+00        
      262144         65536     float     sum    87.16    3.01    3.01  0e+00    87.70    2.99    2.99  0e+00        
      524288        131072     float     sum    156.3    3.35    3.35  0e+00    154.1    3.40    3.40  0e+00        
     1048576        262144     float     sum    334.1    3.14    3.14  0e+00    330.3    3.17    3.17  0e+00        
     2097152        524288     float     sum    670.8    3.13    3.13  0e+00    670.2    3.13    3.13  0e+00        
     4194304       1048576     float     sum   1361.9    3.08    3.08  0e+00   1360.8    3.08    3.08  0e+00        
     8388608       2097152     float     sum   2763.9    3.04    3.04  0e+00   2774.1    3.02    3.02  0e+00        
    16777216       4194304     float     sum   5801.7    2.89    2.89  0e+00   5808.1    2.89    2.89  0e+00        
# Out of bounds values : 0 OK                                                                                       
# Avg bus bandwidth    : 1.39588                                                                                    
#                                                                                                                   

On the other hand, it still did not work through P2P with the settings that both IOMMU and ACS in North Bridge were disabled. The BIOS configurations were shown in the previous comment.

$ LD_LIBRARY_PATH=/opt/nccl/build/lib/ CUDA_VISIBLE_DEVICES=2,3 NCCL_DEBUG=INFO /opt/nccl-tests/build/all_reduce_perf -b 8 -e 16M -f 2 -g 2

# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1         
#                                                                                                                
# Using devices                                                                                                  
#   Rank  0 Pid    253 on nccl-tests device  0 [0xc1] NVIDIA RTX A6000                                           
#   Rank  1 Pid    253 on nccl-tests device  1 [0xc2] NVIDIA RTX A6000                                           
nccl-tests:253:253 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.15<0>                                       
nccl-tests:253:253 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation    
nccl-tests:253:253 [0] NCCL INFO NET/IB : No device found.                                                       
nccl-tests:253:253 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.15<0>                                   
nccl-tests:253:253 [0] NCCL INFO Using network Socket                                                            
NCCL version 2.11.4+cuda11.1                                                                                     
nccl-tests:253:268 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-
nccl-tests:253:267 [0] NCCL INFO Channel 00/04 :    0   1                                                        
nccl-tests:253:268 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff               
nccl-tests:253:267 [0] NCCL INFO Channel 01/04 :    0   1                                                        
nccl-tests:253:267 [0] NCCL INFO Channel 02/04 :    0   1                                                        
nccl-tests:253:267 [0] NCCL INFO Channel 03/04 :    0   1                                                        
nccl-tests:253:267 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->
nccl-tests:253:267 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff               
nccl-tests:253:268 [1] NCCL INFO Channel 00 : 1[c2000] -> 0[c1000] via P2P/direct pointer                        
nccl-tests:253:268 [1] NCCL INFO Channel 01 : 1[c2000] -> 0[c1000] via P2P/direct pointer                        
nccl-tests:253:268 [1] NCCL INFO Channel 02 : 1[c2000] -> 0[c1000] via P2P/direct pointer                        
nccl-tests:253:267 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c2000] via P2P/direct pointer                        
nccl-tests:253:268 [1] NCCL INFO Channel 03 : 1[c2000] -> 0[c1000] via P2P/direct pointer                        
nccl-tests:253:267 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c2000] via P2P/direct pointer                        
nccl-tests:253:267 [0] NCCL INFO Channel 02 : 0[c1000] -> 1[c2000] via P2P/direct pointer                        
nccl-tests:253:267 [0] NCCL INFO Channel 03 : 0[c1000] -> 1[c2000] via P2P/direct pointer                        
nccl-tests:253:268 [1] NCCL INFO Connected all rings                                                             
nccl-tests:253:267 [0] NCCL INFO Connected all rings                                                             
nccl-tests:253:268 [1] NCCL INFO Connected all trees                                                             
nccl-tests:253:267 [0] NCCL INFO Connected all trees                                                             
nccl-tests:253:268 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512                                     
nccl-tests:253:268 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer                        
nccl-tests:253:267 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512                                     
nccl-tests:253:267 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer                        
nccl-tests:253:267 [0] NCCL INFO comm 0x7f26b0000f60 rank 0 nranks 2 cudaDev 0 busId c1000 - Init COMPLETE       
nccl-tests:253:268 [1] NCCL INFO comm 0x7f26a8000f60 rank 1 nranks 2 cudaDev 1 busId c2000 - Init COMPLETE       
#                                                                                                                
#                                                       out-of-place                       in-place              
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error     
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)            
nccl-tests:253:253 [0] NCCL INFO Launch mode Parallel                                                            
sjeaugey commented 2 years ago

Ok, thanks. Aside from adding iommu=pt to the linux kernel cmdline, I'm out of ideas. Besides, the fact only GPU 2 and 3 have that issue tends to show that the configuration might not be the problem. Reseating/swapping cards can sometimes pinpoint the problem to a particular GPU/PCI slot.