NCCL 2.6.4 makes system hanging.

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Other

3.19k stars 807 forks source link

NCCL 2.6.4 makes system hanging. #340

Closed AlexWang1900 closed 4 years ago

AlexWang1900 commented 4 years ago

Linux: Ubuntu 20.04 LTS GPU driver: newest NVidia driver for linux. CUDA 10.1, CUDNN ,7.6.5 NCCL 2.6.4 Hardware : CPU: Intel 9400f, MB: Z370,Ram : 64GB 2-channel, GPU: 2 2080ti on 2 PCIE 3.0 *8, with a NVlink bridge between them

I ran all nccl_tests, it seems NCCL is working. But when each test running(about 30 min for each test), the system freezes, I can't switch to browser or doing anything, I can only move the mouse, but the system doesn't respond to mouse-clicking or keyboard input. when the test finishes running, system goes back to normal ,and LOG prints in console.

log is here：

#  ./all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   3795 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   3795 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float     sum     7.18    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
          16             4   float     sum     7.00    0.00    0.00  0e+00     7.02    0.00    0.00  0e+00
          32             8   float     sum     7.28    0.00    0.00  0e+00     7.19    0.00    0.00  0e+00
          64            16   float     sum     7.20    0.01    0.01  0e+00     7.05    0.01    0.01  0e+00
         128            32   float     sum     7.30    0.02    0.02  0e+00     7.19    0.02    0.02  0e+00
         256            64   float     sum     7.30    0.04    0.04  0e+00     7.20    0.04    0.04  0e+00
         512           128   float     sum     7.47    0.07    0.07  0e+00     7.12    0.07    0.07  0e+00
        1024           256   float     sum     8.14    0.13    0.13  0e+00     7.92    0.13    0.13  0e+00
        2048           512   float     sum     8.56    0.24    0.24  0e+00     8.43    0.24    0.24  0e+00
        4096          1024   float     sum     9.72    0.42    0.42  0e+00     9.49    0.43    0.43  0e+00
        8192          2048   float     sum    11.99    0.68    0.68  0e+00    11.92    0.69    0.69  0e+00
       16384          4096   float     sum    14.36    1.14    1.14  0e+00    14.21    1.15    1.15  0e+00
       32768          8192   float     sum    16.79    1.95    1.95  0e+00    16.64    1.97    1.97  0e+00
       65536         16384   float     sum    21.14    3.10    3.10  0e+00    20.55    3.19    3.19  0e+00
      131072         32768   float     sum    35.56    3.69    3.69  0e+00    35.43    3.70    3.70  0e+00
      262144         65536   float     sum    41.23    6.36    6.36  0e+00    41.21    6.36    6.36  0e+00
      524288        131072   float     sum    50.66   10.35   10.35  0e+00    50.82   10.32   10.32  0e+00
     1048576        262144   float     sum    72.54   14.45   14.45  0e+00    72.45   14.47   14.47  0e+00
     2097152        524288   float     sum    120.7   17.37   17.37  0e+00    118.4   17.71   17.71  0e+00
     4194304       1048576   float     sum    215.2   19.49   19.49  0e+00    214.7   19.53   19.53  0e+00
     8388608       2097152   float     sum    411.3   20.39   20.39  0e+00    399.1   21.02   21.02  0e+00
    16777216       4194304   float     sum    865.3   19.39   19.39  0e+00    779.6   21.52   21.52  0e+00
    33554432       8388608   float     sum   1547.9   21.68   21.68  0e+00   1699.3   19.75   19.75  0e+00
    67108864      16777216   float     sum   3115.1   21.54   21.54  0e+00   3007.4   22.31   22.31  0e+00
   134217728      33554432   float     sum   5994.3   22.39   22.39  0e+00   5991.9   22.40   22.40  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.43886 

/all_gather_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   9119 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   9119 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                             out-of-place                       in-place          
#       size         count    type     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             1   float     7.14    0.00    0.00  0e+00     7.06    0.00    0.00  0e+00
          16             2   float     7.03    0.00    0.00  0e+00     7.00    0.00    0.00  0e+00
          32             4   float     6.96    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
          64             8   float     7.10    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
         128            16   float     7.10    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
         256            32   float     7.18    0.02    0.02  0e+00     7.23    0.02    0.02  0e+00
         512            64   float     7.49    0.03    0.03  0e+00     7.47    0.03    0.03  0e+00
        1024           128   float     7.03    0.07    0.07  0e+00     6.96    0.07    0.07  0e+00
        2048           256   float     6.97    0.15    0.15  0e+00     6.97    0.15    0.15  0e+00
        4096           512   float     7.41    0.28    0.28  0e+00     7.00    0.29    0.29  0e+00
        8192          1024   float     9.59    0.43    0.43  0e+00     8.80    0.47    0.47  0e+00
       16384          2048   float    11.41    0.72    0.72  0e+00    10.78    0.76    0.76  0e+00
       32768          4096   float    13.39    1.22    1.22  0e+00    11.85    1.38    1.38  0e+00
       65536          8192   float    16.57    1.98    1.98  0e+00    13.83    2.37    2.37  0e+00
      131072         16384   float    23.07    2.84    2.84  0e+00    18.39    3.56    3.56  0e+00
      262144         32768   float    31.38    4.18    4.18  0e+00    30.27    4.33    4.33  0e+00
      524288         65536   float    36.00    7.28    7.28  0e+00    35.30    7.43    7.43  0e+00
     1048576        131072   float    47.38   11.06   11.06  0e+00    46.84   11.19   11.19  0e+00
     2097152        262144   float    70.44   14.89   14.89  0e+00    69.77   15.03   15.03  0e+00
     4194304        524288   float    120.1   17.46   17.46  0e+00    115.5   18.16   18.16  0e+00
     8388608       1048576   float    212.5   19.73   19.73  0e+00    210.2   19.95   19.95  0e+00
    16777216       2097152   float    418.5   20.05   20.05  0e+00    414.0   20.26   20.26  0e+00
    33554432       4194304   float    817.8   20.51   20.51  0e+00    785.1   21.37   21.37  0e+00
    67108864       8388608   float   1568.3   21.40   21.40  0e+00   1560.9   21.50   21.50  0e+00
   134217728      16777216   float   3298.6   20.34   20.34  0e+00   3070.3   21.86   21.86  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 6.6972 

./broadcast_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid  26256 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid  26256 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type    root     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float       0     7.24    0.00    0.00  0e+00     7.50    0.00    0.00  0e+00
          16             4   float       0     8.31    0.00    0.00  0e+00     7.69    0.00    0.00  0e+00
          32             8   float       0     8.15    0.00    0.00  0e+00     8.23    0.00    0.00  0e+00
          64            16   float       0     7.19    0.01    0.01  0e+00     7.13    0.01    0.01  0e+00
         128            32   float       0     7.25    0.02    0.02  0e+00     7.45    0.02    0.02  0e+00
         256            64   float       0     7.08    0.04    0.04  0e+00     7.16    0.04    0.04  0e+00
         512           128   float       0     7.47    0.07    0.07  0e+00     7.39    0.07    0.07  0e+00
        1024           256   float       0     7.19    0.14    0.14  0e+00    32.19    0.03    0.03  0e+00
        2048           512   float       0     7.36    0.28    0.28  0e+00     7.03    0.29    0.29  0e+00
        4096          1024   float       0     7.25    0.57    0.57  0e+00     7.07    0.58    0.58  0e+00
        8192          2048   float       0     9.11    0.90    0.90  0e+00     8.10    1.01    1.01  0e+00
       16384          4096   float       0    10.97    1.49    1.49  0e+00    10.52    1.56    1.56  0e+00
       32768          8192   float       0    13.36    2.45    2.45  0e+00    11.73    2.79    2.79  0e+00
       65536         16384   float       0    17.03    3.85    3.85  0e+00    14.24    4.60    4.60  0e+00
      131072         32768   float       0    22.66    5.78    5.78  0e+00    22.60    5.80    5.80  0e+00
      262144         65536   float       0    28.48    9.21    9.21  0e+00    28.45    9.21    9.21  0e+00
      524288        131072   float       0    40.26   13.02   13.02  0e+00    40.08   13.08   13.08  0e+00
     1048576        262144   float       0    63.48   16.52   16.52  0e+00    63.19   16.59   16.59  0e+00
     2097152        524288   float       0    110.1   19.04   19.04  0e+00    109.3   19.19   19.19  0e+00
     4194304       1048576   float       0    205.7   20.39   20.39  0e+00    237.1   17.69   17.69  0e+00
     8388608       2097152   float       0    425.1   19.73   19.73  0e+00    386.7   21.69   21.69  0e+00
    16777216       4194304   float       0    815.0   20.59   20.59  0e+00    824.0   20.36   20.36  0e+00
    33554432       8388608   float       0   1536.8   21.83   21.83  0e+00   1508.2   22.25   22.25  0e+00
    67108864      16777216   float       0   3139.2   21.38   21.38  0e+00   3124.3   21.48   21.48  0e+00
   134217728      33554432   float       0   6283.5   21.36   21.36  0e+00   5873.1   22.85   22.85  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.99748 

$ ./reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   4810 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   4810 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop    root     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float     sum       0     7.16    0.00    0.00  0e+00     7.35    0.00    0.00  0e+00
          16             4   float     sum       0     7.74    0.00    0.00  0e+00     7.67    0.00    0.00  0e+00
          32             8   float     sum       0     7.08    0.00    0.00  0e+00     7.07    0.00    0.00  0e+00
          64            16   float     sum       0     7.13    0.01    0.01  0e+00     7.14    0.01    0.01  0e+00
         128            32   float     sum       0     7.15    0.02    0.02  0e+00     7.06    0.02    0.02  0e+00
         256            64   float     sum       0     7.14    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
         512           128   float     sum       0     7.14    0.07    0.07  0e+00     7.11    0.07    0.07  0e+00
        1024           256   float     sum       0     7.09    0.14    0.14  0e+00     7.09    0.14    0.14  0e+00
        2048           512   float     sum       0     7.11    0.29    0.29  0e+00     7.12    0.29    0.29  0e+00
        4096          1024   float     sum       0     7.28    0.56    0.56  0e+00     7.20    0.57    0.57  0e+00
        8192          2048   float     sum       0     8.72    0.94    0.94  0e+00     8.59    0.95    0.95  0e+00
       16384          4096   float     sum       0    10.80    1.52    1.52  0e+00    10.78    1.52    1.52  0e+00
       32768          8192   float     sum       0    12.89    2.54    2.54  0e+00    12.64    2.59    2.59  0e+00
       65536         16384   float     sum       0    16.42    3.99    3.99  0e+00    15.88    4.13    4.13  0e+00
      131072         32768   float     sum       0    23.17    5.66    5.66  0e+00    23.27    5.63    5.63  0e+00
      262144         65536   float     sum       0    29.13    9.00    9.00  0e+00    28.88    9.08    9.08  0e+00
      524288        131072   float     sum       0    40.93   12.81   12.81  0e+00    40.93   12.81   12.81  0e+00
     1048576        262144   float     sum       0    64.30   16.31   16.31  0e+00    64.25   16.32   16.32  0e+00
     2097152        524288   float     sum       0    110.5   18.98   18.98  0e+00    110.6   18.97   18.97  0e+00
     4194304       1048576   float     sum       0    202.1   20.76   20.76  0e+00    202.1   20.76   20.76  0e+00
     8388608       2097152   float     sum       0    386.5   21.70   21.70  0e+00    386.3   21.71   21.71  0e+00
    16777216       4194304   float     sum       0    752.6   22.29   22.29  0e+00    752.5   22.30   22.30  0e+00
    33554432       8388608   float     sum       0   1485.2   22.59   22.59  0e+00   1529.3   21.94   21.94  0e+00
    67108864      16777216   float     sum       0   2947.4   22.77   22.77  0e+00   2945.2   22.79   22.79  0e+00
   134217728      33554432   float     sum       0   5873.8   22.85   22.85  0e+00   5873.8   22.85   22.85  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.22671 
$ ./reduce_scatter_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   5435 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   5435 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             1   float     sum     7.21    0.00    0.00  0e+00     7.28    0.00    0.00  0e+00
          16             2   float     sum     7.12    0.00    0.00  0e+00     7.18    0.00    0.00  0e+00
          32             4   float     sum     7.14    0.00    0.00  0e+00     7.22    0.00    0.00  0e+00
          64             8   float     sum     7.20    0.00    0.00  0e+00     7.15    0.00    0.00  0e+00
         128            16   float     sum     7.14    0.01    0.01  0e+00     7.12    0.01    0.01  0e+00
         256            32   float     sum     7.16    0.02    0.02  0e+00     7.12    0.02    0.02  0e+00
         512            64   float     sum     7.18    0.04    0.04  0e+00     7.12    0.04    0.04  0e+00
        1024           128   float     sum     7.53    0.07    0.07  0e+00     7.27    0.07    0.07  0e+00
        2048           256   float     sum     7.28    0.14    0.14  0e+00     7.23    0.14    0.14  0e+00
        4096           512   float     sum     7.64    0.27    0.27  0e+00     7.57    0.27    0.27  0e+00
        8192          1024   float     sum     9.35    0.44    0.44  0e+00     9.24    0.44    0.44  0e+00
       16384          2048   float     sum    11.33    0.72    0.72  0e+00    11.23    0.73    0.73  0e+00
       32768          4096   float     sum    12.66    1.29    1.29  0e+00    12.62    1.30    1.30  0e+00
       65536          8192   float     sum    15.39    2.13    2.13  0e+00    15.31    2.14    2.14  0e+00
      131072         16384   float     sum    21.02    3.12    3.12  0e+00    21.35    3.07    3.07  0e+00
      262144         32768   float     sum    32.36    4.05    4.05  0e+00    31.98    4.10    4.10  0e+00
      524288         65536   float     sum    39.63    6.61    6.61  0e+00    39.76    6.59    6.59  0e+00
     1048576        131072   float     sum    57.11    9.18    9.18  0e+00    56.88    9.22    9.22  0e+00
     2097152        262144   float     sum    92.96   11.28   11.28  0e+00    92.54   11.33   11.33  0e+00
     4194304        524288   float     sum    166.4   12.60   12.60  0e+00    165.9   12.64   12.64  0e+00
     8388608       1048576   float     sum    308.5   13.59   13.59  0e+00    504.4    8.32    8.32  0e+00
    16777216       2097152   float     sum   1050.1    7.99    7.99  0e+00    693.5   12.10   12.10  0e+00
    33554432       4194304   float     sum   1533.4   10.94   10.94  0e+00   1414.8   11.86   11.86  0e+00
    67108864       8388608   float     sum   2529.2   13.27   13.27  0e+00   2314.2   14.50   14.50  0e+00
   134217728      16777216   float     sum   5619.2   11.94   11.94  0e+00   4905.4   13.68   13.68  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.44552

originally I found this issue with training with Tensorflow, I first submit bug to TENSORFLOW , here is the link:https://github.com/tensorflow/tensorflow/issues/40027

it shows when I remove NVLINK BRIDGE, the TF code runs well , and when I using NVLINK BRIDGE, but not using NCCL, the TF code runs well too. but when I using NCCL and NVLINK BRIDGE, the system halt, make me have to reboot.

sjeaugey commented 4 years ago

Could it be that you are running Xorg on one of the GPUs, so that when TF/NCCL runs, Xorg struggles to get access to the GPU ? If so (and if it easy to try), could you try running Xorg on the CPU integrated GPU and see if the problem is still there ?

I don't think NCCL or TF have much control over how the GPU is shared between CUDA compute and Xorg, so you might want to look for driver configuration. Note however that sharing the GPU between Xorg and TF might affect performance negatively as every time Xorg would get a time slice of one GPU, it would stop TF completely on that GPU, and eventually slow down the other GPU as well as the two GPUs are synchronized. I don't see why NVLink would make a difference in behavior, aside from the fact that it might run much faster, hence make it harder for Xorg to get GPU resources.

AlexWang1900 commented 4 years ago

@sjeaugey I don't think it is the case GPU resources not enough, because: when I use "strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice())" , according to TF guys not calling NCCL, the code runs well ,and the GPU-Utilization are 20~% for both GPUs, and memory usage is low around 1GB out of 11GB for each GPU.

$ nvidia-smi
Wed Jun 10 10:31:40 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   33C    P2    66W / 260W |   1272MiB / 11016MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   33C    P2    61W / 260W |    959MiB / 11019MiB |     17%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1086      G   /usr/lib/xorg/Xorg                            35MiB |
|    0      1727      G   /usr/lib/xorg/Xorg                            76MiB |
|    0      1926      G   /usr/bin/gnome-shell                         191MiB |
|    0      6898      C   /usr/bin/python3                             947MiB |
|    1      6898      C   /usr/bin/python3                             947MiB |
+-----------------------------------------------------------------------------+

But for the exact same code,when I use strategy = tf.distribute.MirroredStrategy() , as a default strategy it is "all_reduce ", system hangs.

And as I mentioned, if I remove the NVLINK Bridge hardware, I can run the code using both strategies, no hanging.

And as I have ran so many other TF codes on one GPU , the GPU utilization goes up to 98% and memory goes to 10.8GB, no hanging ever happened.

It only hangs after I bought a NVLINK bridge and I installed it and using NCCL.

I don't have a integrated GPU, as you know, I bought 2 NVIDIA 2080TIs, HOW would I know I need a integrated GPU??? LOL~~~~

I will try to run remotely the nccl test and TF code, to see if it is NCCL can't run perfectly with a GPU connected to a display monitor.

AlexWang1900 commented 4 years ago

@sjeaugey I run TF code as mentioned above and in original TF bug page. (code is below) https://github.com/tensorflow/docs/blob/master/site/en/tutorials/distribute/keras.ipynb remotely from another PC, using Jupyter as a web server. and I disconnect the display cable from 2080ti.

it's still hanging,

model.fit(train_dataset, epochs=1, callbacks=callbacks)
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

then it stops,
this code should run less than 2s, but after 30 min it is still hanging.

the normal output when not using NCCL is like this:

`model.fit(train_dataset, epochs=10, callbacks=callbacks)
Epoch 1/10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

469/469 [==============================] - ETA: 0s - accuracy: 0.9232 - loss: 0.2688
Learning rate for epoch 1 is 0.0010000000474974513
469/469 [==============================] - 2s 4ms/step - accuracy: 0.9232 - loss: 0.2688 - lr: 0.0010
Epoch 2/10
456/469 [============================>.] - ETA: 0s - accuracy: 0.9739 - loss: 0.0879
Learning rate for epoch 2 is 0.0010000000474974513
469/469 [==============================] - 1s 3ms/step - accuracy: 0.9741 - loss: 0.0874 - lr: 0.0010
Epoch 3/10
452/469 [===========================>..] - ETA: 0s - accuracy: 0.9832 - loss: 0.0574
Learning rate for epoch 3 is 0.0010000000474974513
469/469 [==============================] - 1s 3ms/step - accuracy: 0.9830 - loss: 0.0577 - lr: 0.0010
Epoch 4/10
467/469 [============================>.] - ETA: 0s - accuracy: 0.9900 - loss: 0.0368
Learning rate for epoch 4 is 9.999999747378752e-05
469/469 [==============================] - 1s 3ms/step - accuracy: 0.9900 - loss: 0.0369 - lr: 1.0000e-04
Epoch 5/10
466/469 [============================>.] - ETA: 0s - accuracy: 0.9910 - loss: 0.0340
Learning rate for epoch 5 is 9.999999747378752e-05
469/469 [==============================] - 1s 3ms/step - accuracy: 0.9910 - loss: 0.0340 - lr: 1.0000e-04
Epoch 6/10
457/469 [============================>.] - ETA: 0s - accuracy: 0.9914 - loss: 0.0325
Learning rate for epoch 6 is 9.999999747378752e-05
469/469 [==============================] - 1s 3ms/step - accuracy: 0.9914 - loss: 0.0324 - lr: 1.0000e-04
Epoch 7/10
459/469 [============================>.] - ETA: 0s - accuracy: 0.9920 - loss: 0.0309
Learning rate for epoch 7 is 9.999999747378752e-05
469/469 [==============================] - 1s 3ms/step - accuracy: 0.9920 - loss: 0.0307 - lr: 1.0000e-04
Epoch 8/10
456/469 [============================>.] - ETA: 0s - accuracy: 0.9930 - loss: 0.0282
Learning rate for epoch 8 is 9.999999747378752e-06
469/469 [==============================] - 1s 3ms/step - accuracy: 0.9929 - loss: 0.0284 - lr: 1.0000e-05
Epoch 9/10
458/469 [============================>.] - ETA: 0s - accuracy: 0.9930 - loss: 0.0282
Learning rate for epoch 9 is 9.999999747378752e-06
469/469 [==============================] - 1s 3ms/step - accuracy: 0.9930 - loss: 0.0281 - lr: 1.0000e-05
Epoch 10/10
454/469 [============================>.] - ETA: 0s - accuracy: 0.9929 - loss: 0.0281
Learning rate for epoch 10 is 9.999999747378752e-06
469/469 [==============================] - 1s 3ms/step - accuracy: 0.9929 - loss: 0.0279 - lr: 1.0000e-05

AlexWang1900 commented 4 years ago

@sjeaugey I reboot the system without a display cable connected to the 2080ti then I SSH to it from another PC through intranet.

then I start jupyter server remotely , and runs NCCL-TEST from another session also remotely using : $ ./all_reduce_perf -b 8 -e 16M -f 2 -g 2

at the same time, before NCCL-TEST finished running ,I ran some simple code

import numpy as np
print(2+4)

through jupyter notebook remotely , it is fine , it returns immediately: 6

it shows at least the cpu-dram-dmi-pch-usb-network(I am using a usb wifi card) is not hanging when remotely running NCCL-TEST

but I started another session remotely, fire command nvidia-smi, it never returns anything.

after a while , NCCL-TEST session returned:

# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid   1711 on w-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   1711 on w-system device  1 [0x02] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2   float     sum     7.46    0.00    0.00  0e+00     7.37    0.00    0.00  0e+00
          16             4   float     sum     7.43    0.00    0.00  0e+00     7.45    0.00    0.00  0e+00
          32             8   float     sum     7.39    0.00    0.00  0e+00     7.38    0.00    0.00  0e+00
          64            16   float     sum     7.52    0.01    0.01  0e+00     7.36    0.01    0.01  0e+00
         128            32   float     sum     7.55    0.02    0.02  0e+00     7.45    0.02    0.02  0e+00
         256            64   float     sum     7.57    0.03    0.03  0e+00     7.44    0.03    0.03  0e+00
         512           128   float     sum     7.61    0.07    0.07  0e+00     7.54    0.07    0.07  0e+00
        1024           256   float     sum     8.20    0.12    0.12  0e+00     8.06    0.13    0.13  0e+00
        2048           512   float     sum     8.54    0.24    0.24  0e+00     8.44    0.24    0.24  0e+00
        4096          1024   float     sum     9.92    0.41    0.41  0e+00     9.78    0.42    0.42  0e+00
        8192          2048   float     sum    12.10    0.68    0.68  0e+00    11.92    0.69    0.69  0e+00
       16384          4096   float     sum    14.77    1.11    1.11  0e+00    14.38    1.14    1.14  0e+00
       32768          8192   float     sum    16.97    1.93    1.93  0e+00    16.64    1.97    1.97  0e+00
       65536         16384   float     sum    21.45    3.06    3.06  0e+00    20.51    3.20    3.20  0e+00
      131072         32768   float     sum    35.61    3.68    3.68  0e+00    35.85    3.66    3.66  0e+00
      262144         65536   float     sum    42.46    6.17    6.17  0e+00    41.73    6.28    6.28  0e+00
      524288        131072   float     sum    50.70   10.34   10.34  0e+00    52.69    9.95    9.95  0e+00
     1048576        262144   float     sum    72.64   14.44   14.44  0e+00    72.58   14.45   14.45  0e+00
     2097152        524288   float     sum    121.7   17.23   17.23  0e+00    118.7   17.66   17.66  0e+00
     4194304       1048576   float     sum    215.4   19.47   19.47  0e+00    215.1   19.50   19.50  0e+00
     8388608       2097152   float     sum    400.1   20.96   20.96  0e+00    399.7   20.99   20.99  0e+00
    16777216       4194304   float     sum    779.8   21.51   21.51  0e+00    780.7   21.49   21.49  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 5.53153

then I ran nvidia-smi again ,it returns:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   38C    P0    62W / 260W |      0MiB / 11016MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   38C    P0    28W / 260W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

It is clear that no XORG or any other program is using GPUs.

then I ran the TF code mentioned above again , using NCCL, through Jupyter notebook remotely . it is still hanging

Epoch 1/20
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

AlexWang1900 commented 4 years ago

From the tests above, we can conclude, when NCCL-TEST running, it blocks the nvidia-smi command, NCCL may be the root cause ,when TF calls NCCL, and calls some other GPU API like CUDA, it is blocked by NCCL, and TF is waiting CUDA to return like forever.

sjeaugey commented 4 years ago

The first thing I'd try is to turn persistent mode on (nvidia-smi -pm 1). Then run again the NCCL tests and see if you still see any hang with nvidia-smi.

Also it's quite surprising to see NCCL tests run fine while TF hangs. Could you try setting NCCL_MAX_NCHANNELS=1 and see if the hang still happens ? This will probably make performance worse but might help us understand what's happening.

AlexWang1900 commented 4 years ago

I tested these test cases:

Test case 1 (Locally )

I turned on persistent mode: nvidia-smi -pm 1 I set the NCCL environment variables like this: NCCL_DEBUG=INFO NCCL_MAX_NCHANNELS=1 NCCL_P2P_DISABLE=0 NCCL_ALGO=Ring NCCL_IGNORE_CPU_AFFINITY=0 NCCL_NTHREADS=256 NCCL_DEBUG_SUBSYS=ALL

Then fire: "./all_reduce_perf -b 8 -e 128M -f 2 -g 2" result: it finished running after 15 mins, it is originally(with all default variables and pm = OFF) about 20-30 mins to return, and I don't know if it is too long??? the console showed these firstly after less than 1 minute:

./all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   5637 on wangying-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   5637 on wangying-system device  1 [0x02] GeForce RTX 2080 Ti
wangying-system:5637:5637 [0] NCCL INFO Bootstrap : Using [0]eno1:192.168.0.100<0>
wangying-system:5637:5637 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

wangying-system:5637:5637 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
wangying-system:5637:5637 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.0.100<0>
wangying-system:5637:5637 [0] NCCL INFO Using network Socket
NCCL version 2.6.4+cuda10.1

than showed these more after about 3-4 mins:


wangying-system:5637:5670 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
wangying-system:5637:5671 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring

than after about 10 mins:

wangying-system:5637:5670 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
wangying-system:5637:5670 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/max_link_speed, ignoring
wangying-system:5637:5670 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/../max_link_speed, ignoring
wangying-system:5637:5670 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/max_link_width, ignoring
wangying-system:5637:5670 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/../max_link_width, ignoring
wangying-system:5637:5670 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
wangying-system:5637:5670 [0] NCCL INFO Attribute coll of node net not found
wangying-system:5637:5670 [0] NCCL INFO === System : maxWidth 21.0 ===
wangying-system:5637:5670 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
wangying-system:5637:5670 [0] NCCL INFO + PCI[12.0] - NIC/1F0
wangying-system:5637:5670 [0] NCCL INFO + PCI[6.0] - GPU/1000 (0)
wangying-system:5637:5670 [0] NCCL INFO              + NVL[21.0] - GPU/2000
wangying-system:5637:5670 [0] NCCL INFO + PCI[6.0] - GPU/2000 (1)
wangying-system:5637:5670 [0] NCCL INFO              + NVL[21.0] - GPU/1000
wangying-system:5637:5670 [0] NCCL INFO ==========================================
wangying-system:5637:5670 [0] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/2000 (1/21.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/6.000000/PHB) 
wangying-system:5637:5670 [0] NCCL INFO GPU/2000 :GPU/1000 (1/21.000000/NVL) GPU/2000 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/6.000000/PHB) 
wangying-system:5637:5670 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:5637:5670 [0] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:5637:5670 [0] NCCL INFO Pattern 2, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:5637:5670 [0] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:5637:5670 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:5637:5670 [0] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:5637:5671 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
wangying-system:5637:5671 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/max_link_speed, ignoring
wangying-system:5637:5671 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/../max_link_speed, ignoring
wangying-system:5637:5671 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/max_link_width, ignoring
wangying-system:5637:5671 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/../max_link_width, ignoring
wangying-system:5637:5671 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
wangying-system:5637:5671 [1] NCCL INFO Attribute coll of node net not found
wangying-system:5637:5671 [1] NCCL INFO === System : maxWidth 21.0 ===
wangying-system:5637:5671 [1] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
wangying-system:5637:5671 [1] NCCL INFO + PCI[12.0] - NIC/1F0
wangying-system:5637:5671 [1] NCCL INFO + PCI[6.0] - GPU/1000 (0)
wangying-system:5637:5671 [1] NCCL INFO              + NVL[21.0] - GPU/2000
wangying-system:5637:5671 [1] NCCL INFO + PCI[6.0] - GPU/2000 (1)
wangying-system:5637:5671 [1] NCCL INFO              + NVL[21.0] - GPU/1000
wangying-system:5637:5671 [1] NCCL INFO ==========================================
wangying-system:5637:5671 [1] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/2000 (1/21.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/6.000000/PHB) 
wangying-system:5637:5671 [1] NCCL INFO GPU/2000 :GPU/1000 (1/21.000000/NVL) GPU/2000 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/6.000000/PHB) 
wangying-system:5637:5671 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:5637:5671 [1] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:5637:5671 [1] NCCL INFO Pattern 2, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:5637:5671 [1] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:5637:5671 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:5637:5671 [1] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:5637:5671 [1] NCCL INFO NCCL_MAX_NCHANNELS set by environment to 1.
wangying-system:5637:5671 [1] NCCL INFO NCCL_NTHREADS set by environment to 256.
wangying-system:5637:5670 [0] NCCL INFO Channel 00/01 :    0   1
wangying-system:5637:5671 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
wangying-system:5637:5670 [0] NCCL INFO Latency/AlgBw |    Tree/    LL |    Tree/ LL128 |    Tree/Simple |    Ring/    LL |    Ring/ LL128 |    Ring/Simple | CollNet/    LL | CollNet/ LL128 | CollNet/Simple |
wangying-system:5637:5671 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1
wangying-system:5637:5670 [0] NCCL INFO  Max NThreads |            256 |            640 |            256 |            256 |            640 |            256 |            256 |            640 |            256 |
wangying-system:5637:5671 [1] NCCL INFO NCCL_IGNORE_CPU_AFFINITY set by environment to 0.
wangying-system:5637:5670 [0] NCCL INFO     Broadcast |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.0/   5.2 |     6.1/   0.0 |    14.1/  21.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
wangying-system:5637:5670 [0] NCCL INFO        Reduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.0/   5.2 |     6.1/   0.0 |    14.1/  21.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
wangying-system:5637:5670 [0] NCCL INFO     AllGather |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.0/  10.5 |     6.1/   0.0 |    14.1/  42.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
wangying-system:5637:5670 [0] NCCL INFO ReduceScatter |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.0/  10.5 |     6.1/   0.0 |    14.1/  42.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
wangying-system:5637:5670 [0] NCCL INFO     AllReduce |     5.4/   0.0 |     8.2/   0.0 |    56.0/   0.0 |     4.4/   5.2 |     8.6/   0.0 |    19.8/  21.0 |    10.4/   0.0 |    13.2/   0.0 |    18.7/   0.0 |
wangying-system:5637:5670 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
wangying-system:5637:5670 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1
wangying-system:5637:5671 [1] NCCL INFO Ring 00 : 1[2000] -> 0[1000] via P2P/direct pointer
wangying-system:5637:5670 [0] NCCL INFO Ring 00 : 0[1000] -> 1[2000] via P2P/direct pointer
wangying-system:5637:5671 [1] NCCL INFO comm 0x7fcd7c000e00 rank 1 nranks 2 cudaDev 1 busId 2000 - Init COMPLETE
wangying-system:5637:5670 [0] NCCL INFO comm 0x7fcd71196170 rank 0 nranks 2 cudaDev 0 busId 1000 - Init COMPLETE
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
wangying-system:5637:5637 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fcdbe000000 recvbuff 0x7fcdb6000000 count 33554432 datatype 7 op 0 root 0 comm 0x7fcd71196170 [nranks=2] stream 0x558925d9ff80
wangying-system:5637:5637 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fcd96000000 recvbuff 0x7fcd8e000000 count 33554432 datatype 7 op 0 root 0 comm 0x7fcd7c000e00 [nranks=2] stream 0x558926518e00
wangying-system:5637:5637 [0] NCCL INFO Launch mode Group/CGMD
wangying-system:5637:5637 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7fcdbe000000 recvbuff 0x7fcdb6000000 count 33554432 datatype 7 op 0 root 0 comm 0x7fcd71196170 [nranks=2] stream 0x558925d9ff80

.... .....  many lines like recvbuff.....................

the system still hangs, I can move mouse but cannot click anything. (the bw is ok , drops about 10%, still better than without P2P communication.)

the questions are : 1 why there are two topology discovery rounds??? 2 why NCCL detects node1 for cpu ? there is only one cpu in system. 3 why tree algorithm tuning shows, when I specified only Ring algorithm

Then I run TF code , it still hangs, and after 20mins ,I reboot, and editing the post here

AlexWang1900 commented 4 years ago

Test case 2

Basically test case 1 again remotely, check if nvidia-smi blocked remotely.

results: 1) nvidia-smi blocked ,and this time ,I waited the NCCL tests finished, then nvidia-smi returned result after NCCL test finished:

Thu Jun 11 11:48:00 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   34C    P8    18W / 260W |    578MiB / 11016MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:02:00.0 Off |                  N/A |
|  0%   35C    P8    12W / 260W |    580MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1762      C   ./all_reduce_perf                            569MiB |
|    1      1762      C   ./all_reduce_perf                            579MiB |
+-----------------------------------------------------------------------------+

this shows ./all_reduce_perf blocks nvidia-smi to return.

2) simple code "print(2+4)" still not blocked. 3) TF code still hangs:

Epoch 1/2
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

but from the jupyter console, I found something : after start running the TF code, jupyter console is returning this before hanging:

"""
2020-06-11 11:51:38.062792: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
[I 11:51:49.184 NotebookApp] Saving file at /Projects/projects_emotion/2GPUtest.ipynb
2020-06-11 11:52:10.157455: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.157946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-11 11:52:10.158035: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.158483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:02:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-11 11:52:10.158666: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-11 11:52:10.177880: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-11 11:52:10.189116: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-11 11:52:10.192011: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-11 11:52:10.212995: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-11 11:52:10.217748: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-11 11:52:10.268347: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-11 11:52:10.268658: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.270592: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.272525: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.274385: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.276123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1
2020-06-11 11:52:10.302103: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-11 11:52:10.331471: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2899885000 Hz
2020-06-11 11:52:10.332315: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f8b0c000b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-11 11:52:10.332379: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-11 11:52:10.721908: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.731219: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.731791: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5ae5ce0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-11 11:52:10.731807: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-06-11 11:52:10.731812: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-06-11 11:52:10.732502: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.732981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-11 11:52:10.733028: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.733490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:02:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-11 11:52:10.733516: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-11 11:52:10.733528: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-11 11:52:10.733540: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-11 11:52:10.733551: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-11 11:52:10.733563: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-11 11:52:10.733574: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-11 11:52:10.733585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-11 11:52:10.733621: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.734107: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.734594: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.735081: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.735539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1
2020-06-11 11:52:10.735774: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-11 11:52:10.736833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-11 11:52:10.736845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 1
2020-06-11 11:52:10.736852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N Y
2020-06-11 11:52:10.736856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1:   Y N
2020-06-11 11:52:10.737195: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.737690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.738180: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.738649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10201 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-06-11 11:52:10.739231: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:52:10.739710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10203 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
2020-06-11 11:52:12.123844: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session started.
2020-06-11 11:52:12.124277: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1363] Profiler found 2 GPUs
2020-06-11 11:52:12.128452: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1
2020-06-11 11:52:12.229408: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1408] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-06-11 11:52:12.231351: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1447] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-06-11 11:52:12.231641: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1430] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-06-11 11:52:13.036926: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-11 11:52:13.649714: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
wangying-system:2441:2494 [1] NCCL INFO Bootstrap : Using [0]wlx90e6ba5a46fa:192.168.31.167<0>
wangying-system:2441:2494 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

wangying-system:2441:2494 [1] external/nccl_archive/src/misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
wangying-system:2441:2494 [1] NCCL INFO NET/Socket : Using [0]wlx90e6ba5a46fa:192.168.31.167<0>
NCCL version 2.5.7+cudaCUDA_MAJOR.CUDA_MINOR
wangying-system:2441:2569 [1] NCCL INFO NCCL_IGNORE_CPU_AFFINITY set by environment to 0.
wangying-system:2441:2569 [1] NCCL INFO Setting affinity for GPU 1 to 3f
wangying-system:2441:2568 [0] NCCL INFO Setting affinity for GPU 0 to 3f
"""

it shows TF is calling NCCL version 2.5.7+cudaCUDA_MAJOR.CUDA_MINOR, I only download NCCL once for 2.6.4, and where is this 2.5.7 come from??? and each time I run NCCL-TEST ,it shows NCCL version 2.6.4

AlexWang1900 commented 4 years ago

update: good news

I was editing post, and leave it hanging after about 30 mins, the TF CODE finished running !!!!

the jupyter cosole returned more after information above:

wangying-system:2441:2568 [0] NCCL INFO Intel CPU (PCI 12, InterCpu 8)
wangying-system:2441:2568 [0] NCCL INFO /sys/devices/pci0000:00/0000:00:14.0/usb1/1-12/1-12:1.0 -> 0/0/0/0
wangying-system:2441:2568 [0] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
wangying-system:2441:2568 [0] NCCL INFO === System : maxWidth 21 maxSpeed 21 ===
wangying-system:2441:2568 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF
wangying-system:2441:2568 [0] NCCL INFO + PCI[12] - GPU/1000 (0)
wangying-system:2441:2568 [0] NCCL INFO             + NVL[21] - GPU/2000
wangying-system:2441:2568 [0] NCCL INFO + PCI[12] - GPU/2000 (1)
wangying-system:2441:2568 [0] NCCL INFO             + NVL[21] - GPU/1000
wangying-system:2441:2568 [0] NCCL INFO + PCI[12] - PCI/0
wangying-system:2441:2568 [0] NCCL INFO             + PCI[12] - NIC/0
wangying-system:2441:2568 [0] NCCL INFO ==========================================
wangying-system:2441:2568 [0] NCCL INFO GPU/1000 :GPU/1000 (0/5000/0) GPU/2000 (1/21/1) CPU/FFFFFFFFFFFFFFFF (1/12/2)
wangying-system:2441:2568 [0] NCCL INFO GPU/2000 :GPU/1000 (1/21/1) GPU/2000 (0/5000/0) CPU/FFFFFFFFFFFFFFFF (1/12/2)
wangying-system:2441:2568 [0] NCCL INFO Pattern 2, crossNic 0, nChannels 1, speed 21/21, nvlink 1, type 1, sameChannels 1
wangying-system:2441:2568 [0] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:2441:2568 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 21/21, nvlink 1, type 1, sameChannels 1
wangying-system:2441:2568 [0] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:2441:2569 [1] NCCL INFO /sys/devices/pci0000:00/0000:00:14.0/usb1/1-12/1-12:1.0 -> 0/0/0/0
wangying-system:2441:2569 [1] NCCL INFO === System : maxWidth 21 maxSpeed 21 ===
wangying-system:2441:2569 [1] NCCL INFO CPU/FFFFFFFFFFFFFFFF
wangying-system:2441:2569 [1] NCCL INFO + PCI[12] - GPU/1000 (0)
wangying-system:2441:2569 [1] NCCL INFO             + NVL[21] - GPU/2000
wangying-system:2441:2569 [1] NCCL INFO + PCI[12] - GPU/2000 (1)
wangying-system:2441:2569 [1] NCCL INFO             + NVL[21] - GPU/1000
wangying-system:2441:2569 [1] NCCL INFO + PCI[12] - PCI/0
wangying-system:2441:2569 [1] NCCL INFO             + PCI[12] - NIC/0
wangying-system:2441:2569 [1] NCCL INFO ==========================================
wangying-system:2441:2569 [1] NCCL INFO GPU/1000 :GPU/1000 (0/5000/0) GPU/2000 (1/21/1) CPU/FFFFFFFFFFFFFFFF (1/12/2)
wangying-system:2441:2569 [1] NCCL INFO GPU/2000 :GPU/1000 (1/21/1) GPU/2000 (0/5000/0) CPU/FFFFFFFFFFFFFFFF (1/12/2)
wangying-system:2441:2569 [1] NCCL INFO Pattern 2, crossNic 0, nChannels 1, speed 21/21, nvlink 1, type 1, sameChannels 1
wangying-system:2441:2569 [1] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:2441:2569 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 21/21, nvlink 1, type 1, sameChannels 1
wangying-system:2441:2569 [1] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:2441:2569 [1] NCCL INFO NCCL_MAX_NCHANNELS set by environment to 1.
wangying-system:2441:2569 [1] NCCL INFO NCCL_NTHREADS set by environment to 256.
wangying-system:2441:2569 [1] NCCL INFO Threads per block : 256/640/256
wangying-system:2441:2568 [0] NCCL INFO Channel 00/01 :    0   1
wangying-system:2441:2569 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64
wangying-system:2441:2568 [0] NCCL INFO Threads per block : 256/640/256
wangying-system:2441:2569 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1
wangying-system:2441:2568 [0] NCCL INFO Latency/AlgBw | Tree/    LL | Tree/ LL128 | Tree/Simple | Ring/    LL | Ring/ LL128 | Ring/Simple |
wangying-system:2441:2568 [0] NCCL INFO     Broadcast |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.0/  5.2|    6.1/  0.0|   14.1/ 21.0|
wangying-system:2441:2568 [0] NCCL INFO        Reduce |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.0/  5.2|    6.1/  0.0|   14.1/ 21.0|
wangying-system:2441:2568 [0] NCCL INFO     AllGather |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.0/ 10.5|    6.1/  0.0|   14.1/ 42.0|
wangying-system:2441:2568 [0] NCCL INFO ReduceScatter |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.0/ 10.5|    6.1/  0.0|   14.1/ 42.0|
wangying-system:2441:2568 [0] NCCL INFO     AllReduce |    5.4/  0.0|    8.2/  0.0|   56.0/  0.0|    4.4/  5.2|    8.6/  0.0|   19.8/ 21.0|
wangying-system:2441:2568 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64
wangying-system:2441:2568 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1
wangying-system:2441:2569 [1] NCCL INFO Ring 00 : 1[2000] -> 0[1000] via P2P/direct pointer
wangying-system:2441:2568 [0] NCCL INFO Ring 00 : 0[1000] -> 1[2000] via P2P/direct pointer
wangying-system:2441:2569 [1] NCCL INFO comm 0x7f8a5c0aa760 rank 1 nranks 2 cudaDev 1 busId 2000 - Init COMPLETE
wangying-system:2441:2568 [0] NCCL INFO comm 0x7f8a582d31f0 rank 0 nranks 2 cudaDev 0 busId 1000 - Init COMPLETE
wangying-system:2441:2565 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f8a80c00000 recvbuff 0x7f8a80c00000 count 347146 datatype 7 op 0 root 0 comm 0x7f8a582d31f0 [nranks=2] stream 0x7f898f832140
wangying-system:2441:2565 [0] NCCL INFO Launch mode Group/CGMD
wangying-system:2441:2566 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f8a816a4000 recvbuff 0x7f8a816a4000 count 347146 datatype 7 op 0 root 0 comm 0x7f8a5c0aa760 [nranks=2] stream 0x7f898f8443d0
2020-06-11 12:18:16.116142: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session started.
2020-06-11 12:18:16.116204: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1408] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-06-11 12:18:16.116223: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1447] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_NOT_INITIALIZED
wangying-system:2441:2565 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f8a80c00000 recvbuff 0x7f8a80c00000 count 347146 datatype 7 op 0 root 0 comm 0x7f8a582d31f0 [nranks=2] stream 0x7f898f832140
wangying-system:2441:2566 [1] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f8a816a4000 recvbuff 0x7f8a816a4000 count 347146 datatype 7 op 0 root 0 comm 0x7f8a5c0aa760 [nranks=2] stream 0x7f898f8443d0
2020-06-11 12:18:16.122365: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1430] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-06-11 12:18:16.123972: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216]  GpuTracer has collected 0 callback api events and 0 activity events.
2020-06-11 12:18:16.130630: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: ./logs/train/plugins/profile/2020_06_11_12_18_16
2020-06-11 12:18:16.134600: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to ./logs/train/plugins/profile/2020_06_11_12_18_16/wangying-system.trace.json.gz
2020-06-11 12:18:16.136308: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 0 ms

2020-06-11 12:18:16.136868: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: ./logs/train/plugins/profile/2020_06_11_12_18_16Dumped tool data for overview_page.pb to ./logs/train/plugins/profile/2020_06_11_12_18_16/wangying-system.overview_page.pb
Dumped tool data for input_pipeline.pb to ./logs/train/plugins/profile/2020_06_11_12_18_16/wangying-system.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to ./logs/train/plugins/profile/2020_06_11_12_18_16/wangying-system.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to ./logs/train/plugins/profile/2020_06_11_12_18_16/wangying-system.kernel_stats.pb
..     many many lines about .recv buff,.....

the jupyter web returns:

Epoch 1/2
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
468/469 [============================>.] - ETA: 0s - loss: 0.2585 - accuracy: 0.9256
Learning rate for epoch 1 is 0.0010000000474974513
469/469 [==============================] - 2s 5ms/step - loss: 0.2584 - accuracy: 0.9257 - lr: 0.0010
Epoch 2/2
469/469 [==============================] - ETA: 0s - loss: 0.0865 - accuracy: 0.9742
Learning rate for epoch 2 is 0.0010000000474974513
469/469 [==============================] - 1s 3ms/step - loss: 0.0865 - accuracy: 0.9742 - lr: 0.0010

Which is successfully finished as expected, but should be finished less than 30 seconds, not about 30 mins why it took so long????

AlexWang1900 commented 4 years ago

From the two test case above , there are these questions: 0 Why it took 30 mins for TF to run with NCCL, while the same code runs less than 30 seconds without NCCL. 1 why would NCCL do a topology discovery every time it starts? it is like Windows doing a disk-scan every time I start playing minesweeper. 2 why the TF calls NCCL 2.5.7, while NCCL-TEST calls 2.6.4 3 why there are two topology discovery rounds??? I think nowadays all direct physical links inside computer/communication systems are full duplex , so the topology must be a un-directed graph , maybe acyclic or not , one discovery is good enough. even if GPUs are not connected in anyway , users should be warned, not to start NCCL at all. 4 why NCCL detects node1 for cpu ? there is only one cpu in system. 5 why tree algorithm tuning shows, when I specified only Ring algorithm

After all ,it is not clear to use NVLINK BRIDGE, MUST one use NCCL? since I ran CUDA utility test, it shows great speed between 2 GPUs, and NCCL uses CUDA kernels to perform inter-GPU communication and if I use "reduce to one device strategy, can I get the benefit using NVLINK??

AlexWang1900 commented 4 years ago

Test Case 3

I run the TF code again locally to capture full TF log to show TIME STAMPS. It took 26 mins from NCCL being called for the first time (20:08:36 ) to NCCL Initialization complete ( 20:34:39 )

[I 20:07:56.656 NotebookApp] Starting buffering for 2bb107c1-a1d3-4591-afe6-8a25e7f8b1eb:b9eeb086908244df81556b35a197df74
[I 20:07:56.862 NotebookApp] Kernel restarted: 2bb107c1-a1d3-4591-afe6-8a25e7f8b1eb
[I 20:07:57.223 NotebookApp] Restoring connection for 2bb107c1-a1d3-4591-afe6-8a25e7f8b1eb:b9eeb086908244df81556b35a197df74
[I 20:07:57.223 NotebookApp] Replaying 6 buffered messages
2020-06-11 20:07:59.162870: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-11 20:08:31.226164: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.226618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-11 20:08:31.226666: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.227050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties: 
pciBusID: 0000:02:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-11 20:08:31.227193: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-11 20:08:31.245745: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-11 20:08:31.256383: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-11 20:08:31.258705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-11 20:08:31.277503: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-11 20:08:31.280465: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-11 20:08:31.326590: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-11 20:08:31.326900: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.328778: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.329402: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.329837: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.330253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1
2020-06-11 20:08:31.340112: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-11 20:08:31.363991: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2899885000 Hz
2020-06-11 20:08:31.364279: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f6948000b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-11 20:08:31.364293: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-11 20:08:31.544719: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.553529: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.553991: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4936cb0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-11 20:08:31.554006: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-06-11 20:08:31.554011: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-06-11 20:08:31.554559: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.554944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-11 20:08:31.554990: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.555362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties: 
pciBusID: 0000:02:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-11 20:08:31.555392: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-11 20:08:31.555405: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-11 20:08:31.555416: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-11 20:08:31.555428: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-11 20:08:31.555440: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-11 20:08:31.555451: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-11 20:08:31.555464: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-11 20:08:31.555502: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.556281: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.556757: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.557163: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.557532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1
2020-06-11 20:08:31.557755: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-11 20:08:31.558879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-11 20:08:31.558894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 1 
2020-06-11 20:08:31.558899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N Y 
2020-06-11 20:08:31.558903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1:   Y N 
2020-06-11 20:08:31.559222: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.559632: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.560041: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.560425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9974 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-06-11 20:08:31.561006: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 20:08:31.561396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10202 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
2020-06-11 20:08:33.012283: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session started.
2020-06-11 20:08:33.012750: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1363] Profiler found 2 GPUs
2020-06-11 20:08:33.016798: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1
2020-06-11 20:08:33.117544: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1408] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-06-11 20:08:33.119155: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1447] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-06-11 20:08:33.119578: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1430] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-06-11 20:08:35.945095: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
**2020-06-11 20:08:36.611558: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7**
wangying-system:7436:7497 [0] NCCL INFO Bootstrap : Using [0]eno1:192.168.0.100<0>
wangying-system:7436:7497 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

wangying-system:7436:7497 [0] external/nccl_archive/src/misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
wangying-system:7436:7497 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.0.100<0>
NCCL version 2.5.7+cudaCUDA_MAJOR.CUDA_MINOR
wangying-system:7436:7593 [0] NCCL INFO NCCL_IGNORE_CPU_AFFINITY set by environment to 1.
wangying-system:7436:7593 [0] NCCL INFO Setting affinity for GPU 0 to 3f
wangying-system:7436:7594 [1] NCCL INFO Setting affinity for GPU 1 to 3f
[I 20:09:38.772 NotebookApp] Saving file at /Projects/projects_emotion/2GPUtest.ipynb
wangying-system:7436:7593 [0] NCCL INFO Intel CPU (PCI 12, InterCpu 8)
wangying-system:7436:7593 [0] NCCL INFO /sys/devices/pci0000:00/0000:00:1f.6 -> 0/0/0/0
wangying-system:7436:7593 [0] NCCL INFO NCCL_P2P_DISABLE set by environment to 0.
wangying-system:7436:7593 [0] NCCL INFO === System : maxWidth 21 maxSpeed 21 ===
wangying-system:7436:7593 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF
wangying-system:7436:7593 [0] NCCL INFO + PCI[12] - GPU/1000 (0)
wangying-system:7436:7593 [0] NCCL INFO             + NVL[21] - GPU/2000
wangying-system:7436:7593 [0] NCCL INFO + PCI[12] - GPU/2000 (1)
wangying-system:7436:7593 [0] NCCL INFO             + NVL[21] - GPU/1000
wangying-system:7436:7593 [0] NCCL INFO + PCI[12] - NIC/0
wangying-system:7436:7593 [0] NCCL INFO ==========================================
wangying-system:7436:7593 [0] NCCL INFO GPU/1000 :GPU/1000 (0/5000/0) GPU/2000 (1/21/1) CPU/FFFFFFFFFFFFFFFF (1/12/2) 
wangying-system:7436:7593 [0] NCCL INFO GPU/2000 :GPU/1000 (1/21/1) GPU/2000 (0/5000/0) CPU/FFFFFFFFFFFFFFFF (1/12/2) 
wangying-system:7436:7593 [0] NCCL INFO Pattern 2, crossNic 0, nChannels 1, speed 21/21, nvlink 1, type 1, sameChannels 1
wangying-system:7436:7593 [0] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:7436:7593 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 21/21, nvlink 1, type 1, sameChannels 1
wangying-system:7436:7593 [0] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:7436:7594 [1] NCCL INFO /sys/devices/pci0000:00/0000:00:1f.6 -> 0/0/0/0
wangying-system:7436:7594 [1] NCCL INFO === System : maxWidth 21 maxSpeed 21 ===
wangying-system:7436:7594 [1] NCCL INFO CPU/FFFFFFFFFFFFFFFF
wangying-system:7436:7594 [1] NCCL INFO + PCI[12] - GPU/1000 (0)
wangying-system:7436:7594 [1] NCCL INFO             + NVL[21] - GPU/2000
wangying-system:7436:7594 [1] NCCL INFO + PCI[12] - GPU/2000 (1)
wangying-system:7436:7594 [1] NCCL INFO             + NVL[21] - GPU/1000
wangying-system:7436:7594 [1] NCCL INFO + PCI[12] - NIC/0
wangying-system:7436:7594 [1] NCCL INFO ==========================================
wangying-system:7436:7594 [1] NCCL INFO GPU/1000 :GPU/1000 (0/5000/0) GPU/2000 (1/21/1) CPU/FFFFFFFFFFFFFFFF (1/12/2) 
wangying-system:7436:7594 [1] NCCL INFO GPU/2000 :GPU/1000 (1/21/1) GPU/2000 (0/5000/0) CPU/FFFFFFFFFFFFFFFF (1/12/2) 
wangying-system:7436:7594 [1] NCCL INFO Pattern 2, crossNic 0, nChannels 1, speed 21/21, nvlink 1, type 1, sameChannels 1
wangying-system:7436:7594 [1] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:7436:7594 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 21/21, nvlink 1, type 1, sameChannels 1
wangying-system:7436:7594 [1] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:7436:7593 [0] NCCL INFO NCCL_MAX_NCHANNELS set by environment to 1.
wangying-system:7436:7593 [0] NCCL INFO Channel 00/01 :    0   1
wangying-system:7436:7593 [0] NCCL INFO NCCL_NTHREADS set by environment to 256.
wangying-system:7436:7593 [0] NCCL INFO Threads per block : 256/640/256
wangying-system:7436:7593 [0] NCCL INFO Latency/AlgBw | Tree/    LL | Tree/ LL128 | Tree/Simple | Ring/    LL | Ring/ LL128 | Ring/Simple |
wangying-system:7436:7593 [0] NCCL INFO     Broadcast |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.0/  5.2|    6.1/  0.0|   14.1/  0.0|
wangying-system:7436:7593 [0] NCCL INFO        Reduce |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.0/  5.2|    6.1/  0.0|   14.1/  0.0|
wangying-system:7436:7593 [0] NCCL INFO     AllGather |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.0/ 10.5|    6.1/  0.0|   14.1/  0.0|
wangying-system:7436:7593 [0] NCCL INFO ReduceScatter |    0.0/  0.0|    0.0/  0.0|    0.0/  0.0|    4.0/ 10.5|    6.1/  0.0|   14.1/  0.0|
wangying-system:7436:7593 [0] NCCL INFO     AllReduce |    5.4/  0.0|    8.2/  0.0|   56.0/  0.0|    4.4/  5.2|    8.6/  0.0|   19.8/  0.0|
wangying-system:7436:7593 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64
wangying-system:7436:7593 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1
wangying-system:7436:7594 [1] NCCL INFO Threads per block : 256/640/256
wangying-system:7436:7594 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64
wangying-system:7436:7594 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1
wangying-system:7436:7594 [1] NCCL INFO Ring 00 : 1[2000] -> 0[1000] via P2P/direct pointer
wangying-system:7436:7593 [0] NCCL INFO Ring 00 : 0[1000] -> 1[2000] via P2P/direct pointer
wangying-system:7436:7593 [0] NCCL INFO comm 0x7f6924f08540 rank 0 nranks 2 cudaDev 0 busId 1000 - Init COMPLETE
wangying-system:7436:7594 [1] NCCL INFO comm 0x7f692000d240 rank 1 nranks 2 cudaDev 1 busId 2000 - Init COMPLETE
wangying-system:7436:7590 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f68b2c00000 recvbuff 0x7f68b2c00000 count 347146 datatype 7 op 0 root 0 comm 0x7f6924f08540 [nranks=2] stream 0x7f67ec093f90
wangying-system:7436:7591 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f68b36a4000 recvbuff 0x7f68b36a4000 count 347146 datatype 7 op 0 root 0 comm 0x7f692000d240 [nranks=2] stream 0x7f67ec0944b0
wangying-system:7436:7590 [0] NCCL INFO Launch mode Group/CGMD
**2020-06-11 20:34:39.023120: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session started.**
2020-06-11 20:34:39.023195: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1408] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_NOT_INITIALIZED
2020-06-11 20:34:39.023212: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1447] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_NOT_INITIALIZED
wangying-system:7436:7591 [1] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f68b36a4000 recvbuff 0x7f68b36a4000 count 347146 datatype 7 op 0 root 0 comm 0x7f692000d240 [nranks=2] stream 0x7f67ec0944b0
wangying-system:7436:7590 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f68b2c00000 recvbuff 0x7f68b2c00000 count 347146 datatype 7 op 0 root 0 comm 0x7f6924f08540 [nranks=2] stream 0x7f67ec093f90
2020-06-11 20:34:39.035513: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1430] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-06-11 20:34:39.038820: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216]  GpuTracer has collected 0 callback api events and 0 activity events.
2020-06-11 20:34:39.045918: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: ./logs/train/plugins/profile/2020_06_11_20_34_39
2020-06-11 20:34:39.051031: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to ./logs/train/plugins/profile/2020_06_11_20_34_39/wangying-system.trace.json.gz
2020-06-11 20:34:39.053236: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 0 ms

2020-06-11 20:34:39.053822: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: ./logs/train/plugins/profile/2020_06_11_20_34_39Dumped tool data for overview_page.pb to ./logs/train/plugins/profile/2020_06_11_20_34_39/wangying-system.overview_page.pb
Dumped tool data for input_pipeline.pb to ./logs/train/plugins/profile/2020_06_11_20_34_39/wangying-system.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to ./logs/train/plugins/profile/2020_06_11_20_34_39/wangying-system.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to ./logs/train/plugins/profile/2020_06_11_20_34_39/wangying-system.kernel_stats.pb

wangying-system:7436:7590 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f68b2c00000 recvbuff 0x7f68b2c00000 count 347146 datatype 7 op 0 root 0 comm 0x7f6924f08540 [nranks=2] stream 0x7f67ec093f90
wangying-system:7436:7591 [1] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f68b36a4000 recvbuff 0x7f68b36a4000 count 347146 datatype 7 op 0 root 0 comm 0x7f692000d240 [nranks=2] stream 0x7f67ec0944b0
wangying-system:7436:7590 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f68b2c00000 recvbuff 0x7f68b2c00000 count 347146 datatype 7 op 0 root 0 comm 0x7f6924f08540 [nranks=2] stream 0x7f67ec093f90
....... skip many sendbuff and recvbuff lines...........
....... skip many sendbuff and recvbuff lines...........
wangying-system:7436:7590 [0] NCCL INFO AllReduce: opCount 3a9 sendbuff 0x7f6861845800 recvbuff 0x7f6861845800 count 347146 datatype 7 op 0 root 0 comm 0x7f6924f08540 [nranks=2] stream 0x7f67ec093f90
wangying-system:7436:7591 [1] NCCL INFO AllReduce: opCount 3a9 sendbuff 0x7f68b36a4000 recvbuff 0x7f68b36a4000 count 347146 datatype 7 op 0 root 0 comm 0x7f692000d240 [nranks=2] stream 0x7f67ec0944b0
2020-06-11 20:34:45.313519: W tensorflow/python/util/util.cc:329] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
[I 20:36:38.845 NotebookApp] Saving file at /Projects/projects_emotion/2GPUtest.ipynb

I run NCCL-TEST again, after NCCL starts to discover topology ,right before system hangs , the task monitor shows this:

it is a typycal pattern that multi processes are using some blocking-queue model playing consumer and producer, and they are not really parallelized. one process is wating the other process. I think some other model like linkedBlockingQueue can be much faster and let them run concurrently.

sjeaugey commented 4 years ago

Trying to answer all of your questions :

It is not normal for the NCCL init to take 10-30 minutes. It should be at most a second. Something is going wrong here, and this is what I'm trying to figure out.
Each rank discovers the topology every time because it's simpler and it is not supposed to take a significant amount of time.
TF is embedding a version of NCCL they choose, so it is not using the NCCL library on the system
There is only one topology discovery round, but both GPUs do it (in parallel). And again, it should be near instantaneous.
Actually NCCL detects node "-1". Which is not normal, it should be 0. That said, it should not be a big problem, especially on a single CPU system.
Trees and Rings are created regardless of what is specified in NCCL_ALGO. That setting is taken into account in the tuning phase where we decide which one to use. I agree this could be done differently.

Let's remove the following environment variables moving forward (they should not be needed and could cause issues) :

NCCL_MAX_NCHANNELS=1
NCCL_P2P_DISABLE=0
NCCL_NTHREADS=256

Given the CPU usage shown here, I'm wondering whether CUDA would be recompiling NCCL every time, considering it doesn't have the support for your GPU architecture. Since you mentioned you are using NCCL 2.6.4, may I ask how you installed it ? Did you get binaries from developer.nvidia.com or did you recompile NCCL yourself ? If so, did you add any NVCC_GENCODE option to your make command ?

If that is indeed your problem, one solution could be to recompile NCCL adding NVCC_GENCODE=-gencode=arch=compute_75,code=sm_75 to the make command line. That should solve the problem with the NCCL tests. But that might not solve your problem with tensorflow as it is embedding its own version of NCCL. The other solution would be to make sure the code we recompile is cached (in ~/.nv/ComputeCache) so that we only recompile it once. From the doc at https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars, setting CUDA_CACHE_MAXSIZE=4294967296 CUDA_CACHE_DISABLE=0 should ensure that.

AlexWang1900 commented 4 years ago

@sjeaugey Thanks a lot!!!! "How I install NCCL? " I got binaries from developer.nvidia.com

I went to a download page for NCCL 2.6.4 for CUDA 10.1(which is the highest supported version by TF2.2) I download a package for Ubuntu 18.04 LTS(which is the closest to my 20.04LTS ) "nccl-repo-ubuntu1804-2.6.4-ga-cuda10.1_1-1_amd64.deb" 48.2MB I ran sudo dpkg -i nccl-repo-ubuntu1804-2.6.4-ga-cuda10.1_1-1_amd64.deb it showed decompress, copy and set something. Then I ran TF code, which froze the system.

Then I go to the same download page again ,I downloaded the package for universal Linux: "nccl_2.6.4-1+cuda10.1_x86_64.txz" 45.2MB I decompressed it , copy the lib files and include files to CUDA 10.1's lib64 and include archives . Then I ran TF code, which froze the system again ,Then I submitted a bug at TF's github.

I check the ~/.nv/ComputeCache archive, it has an index file ,and 9 archives there, each archive have 2 archives and one file in them each. all file shows last modify date is 2020-5-28, which is the day I first ran it I think.

sjeaugey commented 4 years ago

Thanks.

20.04 is not supported yet, so I'd like to make sure it's not caused by a weird 18.04/20.04 interaction.

Could you recompile NCCL and launch the perf test again :

# Get NCCL and build it
git clone https://github.com/NVIDIA/nccl
cd nccl
# Install g++-7 if not installed already
sudo apt install g++-7
make -j 8 test.build CXX=g++-7 NVCC_GENCODE=-gencode=arch=compute_75,code=sm_75
# Use the newly built library
export LD_LIBRARY_PATH=$PWD/build/lib:$LD_LIBRARY_PATH
# Run the tests again
cd /path/to/nccl-tests/build/
./all_reduce_perf -b 8 -e 128M -f 2 -g 2

AlexWang1900 commented 4 years ago

I have delete everything inside ~/.nv/ComputeCache, then run nccl-test again, nothing generated in that archive I then run all the code above to build a NCCL 2.7.3, and all_reduce_perf, nothing generated in ~/.nv/ComputeCache archive either

it still takes 25 mins to finish, here is the result:

$./all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid  12262 on wangying-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid  12262 on wangying-system device  1 [0x02] GeForce RTX 2080 Ti
wangying-system:12262:12262 [0] NCCL INFO Bootstrap : Using [0]eno1:192.168.0.100<0>
wangying-system:12262:12262 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

wangying-system:12262:12262 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
wangying-system:12262:12262 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.0.100<0>
wangying-system:12262:12262 [0] NCCL INFO Using network Socket
NCCL version 2.7.3+cuda10.1
wangying-system:12262:12276 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
wangying-system:12262:12275 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
wangying-system:12262:12276 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
wangying-system:12262:12275 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
wangying-system:12262:12276 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/max_link_speed, ignoring
wangying-system:12262:12276 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/../max_link_speed, ignoring
wangying-system:12262:12276 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/max_link_width, ignoring
wangying-system:12262:12276 [1] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/../max_link_width, ignoring
wangying-system:12262:12276 [1] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
wangying-system:12262:12276 [1] NCCL INFO Attribute coll of node net not found
wangying-system:12262:12276 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
wangying-system:12262:12276 [1] NCCL INFO === System : maxWidth 21.0 ===
wangying-system:12262:12276 [1] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
wangying-system:12262:12276 [1] NCCL INFO + PCI[12.0] - NIC/1F0
wangying-system:12262:12276 [1] NCCL INFO + PCI[6.0] - GPU/1000 (0)
wangying-system:12262:12276 [1] NCCL INFO              + NVL[21.0] - GPU/2000
wangying-system:12262:12276 [1] NCCL INFO + PCI[6.0] - GPU/2000 (1)
wangying-system:12262:12276 [1] NCCL INFO              + NVL[21.0] - GPU/1000
wangying-system:12262:12276 [1] NCCL INFO ==========================================
wangying-system:12262:12276 [1] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/2000 (1/21.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/6.000000/PHB) 
wangying-system:12262:12276 [1] NCCL INFO GPU/2000 :GPU/1000 (1/21.000000/NVL) GPU/2000 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/6.000000/PHB) 
wangying-system:12262:12276 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:12262:12276 [1] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:12262:12276 [1] NCCL INFO Pattern 2, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:12262:12276 [1] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:12262:12276 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:12262:12276 [1] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:12262:12275 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/max_link_speed, ignoring
wangying-system:12262:12275 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/../max_link_speed, ignoring
wangying-system:12262:12275 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/max_link_width, ignoring
wangying-system:12262:12275 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:1f.6/../max_link_width, ignoring
wangying-system:12262:12275 [0] NCCL INFO Topology detection : could not read /sys/devices/system/node/node-1/cpumap, ignoring
wangying-system:12262:12275 [0] NCCL INFO Attribute coll of node net not found
wangying-system:12262:12275 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
wangying-system:12262:12275 [0] NCCL INFO === System : maxWidth 21.0 ===
wangying-system:12262:12275 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF (1/1/2)
wangying-system:12262:12275 [0] NCCL INFO + PCI[12.0] - NIC/1F0
wangying-system:12262:12275 [0] NCCL INFO + PCI[6.0] - GPU/1000 (0)
wangying-system:12262:12275 [0] NCCL INFO              + NVL[21.0] - GPU/2000
wangying-system:12262:12275 [0] NCCL INFO + PCI[6.0] - GPU/2000 (1)
wangying-system:12262:12275 [0] NCCL INFO              + NVL[21.0] - GPU/1000
wangying-system:12262:12275 [0] NCCL INFO ==========================================
wangying-system:12262:12275 [0] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/2000 (1/21.000000/NVL) CPU/FFFFFFFFFFFFFFFF (1/6.000000/PHB) 
wangying-system:12262:12275 [0] NCCL INFO GPU/2000 :GPU/1000 (1/21.000000/NVL) GPU/2000 (0/5000.000000/LOC) CPU/FFFFFFFFFFFFFFFF (1/6.000000/PHB) 
wangying-system:12262:12275 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:12262:12275 [0] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:12262:12275 [0] NCCL INFO Pattern 2, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:12262:12275 [0] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:12262:12275 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 21.000000/21.000000, type NVL/PIX, sameChannels 1
wangying-system:12262:12275 [0] NCCL INFO  0 : GPU/0 GPU/1
wangying-system:12262:12276 [1] NCCL INFO NCCL_MAX_NCHANNELS set by environment to 1.
wangying-system:12262:12276 [1] NCCL INFO NCCL_NTHREADS set by environment to 256.
wangying-system:12262:12275 [0] NCCL INFO Channel 00/01 :    0   1
wangying-system:12262:12276 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
wangying-system:12262:12275 [0] NCCL INFO Latency/AlgBw |    Tree/    LL |    Tree/ LL128 |    Tree/Simple |    Ring/    LL |    Ring/ LL128 |    Ring/Simple | CollNet/    LL | CollNet/ LL128 | CollNet/Simple |
wangying-system:12262:12275 [0] NCCL INFO  Max NThreads |            256 |            640 |            256 |            256 |            640 |            256 |            256 |            640 |            256 |
wangying-system:12262:12275 [0] NCCL INFO     Broadcast |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.1/   7.0 |    11.9/   0.0 |    11.8/  21.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
wangying-system:12262:12275 [0] NCCL INFO        Reduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.1/   5.2 |    11.9/   0.0 |    11.8/  21.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
wangying-system:12262:12275 [0] NCCL INFO     AllGather |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.1/  14.0 |    11.9/   0.0 |    11.8/  42.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
wangying-system:12262:12275 [0] NCCL INFO ReduceScatter |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.1/  14.0 |    11.9/   0.0 |    11.8/  42.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
wangying-system:12262:12275 [0] NCCL INFO     AllReduce |     5.4/   2.5 |     6.8/   0.0 |    56.0/   9.4 |     4.5/   5.2 |    13.8/   0.0 |    15.2/  21.0 |    10.4/   1.6 |    11.8/   0.0 |    18.7/   9.4 |
wangying-system:12262:12275 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
wangying-system:12262:12275 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1
wangying-system:12262:12275 [0] NCCL INFO NCCL_IGNORE_CPU_AFFINITY set by environment to 1.
wangying-system:12262:12276 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1
wangying-system:12262:12275 [0] NCCL INFO Channel 00 : 0[1000] -> 1[2000] via P2P/direct pointer
wangying-system:12262:12276 [1] NCCL INFO Channel 00 : 1[2000] -> 0[1000] via P2P/direct pointer
wangying-system:12262:12276 [1] NCCL INFO 1 coll channels, 1 p2p channels, 1 p2p channels per peer
wangying-system:12262:12275 [0] NCCL INFO 1 coll channels, 1 p2p channels, 1 p2p channels per peer
wangying-system:12262:12275 [0] NCCL INFO comm 0x7efbe5263900 rank 0 nranks 2 cudaDev 0 busId 1000 - Init COMPLETE
wangying-system:12262:12276 [1] NCCL INFO comm 0x7efbe8000e00 rank 1 nranks 2 cudaDev 1 busId 2000 - Init COMPLETE
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
wangying-system:12262:12262 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7efc34000000 recvbuff 0x7efc2c000000 count 33554432 datatype 7 op 0 root 0 comm 0x7efbe5263900 [nranks=2] stream 0x563ad6fa4ee0
wangying-system:12262:12262 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7efc0c000000 recvbuff 0x7efc04000000 count 33554432 datatype 7 op 0 root 0 comm 0x7efbe8000e00 [nranks=2] stream 0x563ad771dd20
wangying-system:12262:12262 [0] NCCL INFO Launch mode Group/CGMD
wangying-system:12262:12262 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7efc34000000 recvbuff 0x7efc2c000000 count 33554432 datatype 7 op 0 root 0 comm 0x7efbe5263900 [nranks=2] stream 0x563ad6fa4ee0
wangying-system:12262:12262 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7efc0c000000 recvbuff 0x7efc04000000 count 33554432 datatype 7 op 0 root 0 comm 0x7efbe8000e00 [nranks=2] stream 0x563ad771dd20
wangying-system:12262:12262 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7efc34000000 recvbuff 0x7efc2c000000 count 33554432 datatype 7 op 0 root 0 comm 0x7efbe5263900 [nranks=2] stream 0x563ad6fa4ee0
..... many lines of sendbuff and recvbuff.......
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 5.39004

AlexWang1900 commented 4 years ago

UPDATE : I think I found a workaround , by changing some NCCL code, it takes less than 2s to finish NCCL-TEST may need some further tests.

sjeaugey commented 4 years ago

Thanks for the update and please let us know what workaround you found !

Your experiments didn't confirm my theory about recompilation so you can forget about JIT recompilation/compute cache -- it's probably something completely different.

AlexWang1900 commented 4 years ago

Hi~~ I have wrapped it up , here we go: I added many traces and opened TRACE for logging, finally I have located the function which caused hanging: ncclTopoGetXmlFromGpu , more pecisely, when calling NVML to check NVLINK part.

So I change the code inside function ncclTopoGetXmlFromGpu in XML.cc like this: I set maxNvLinks = 1, and call only wrapNvmlDeviceGetNvLinkCapability inside this loop. this is for speeding up debuging, but it is the same phenomenon with wrapNvmlDeviceGetNvLinkState and wrapNvmlDeviceGetNvLinkRemotePciInfo


  struct ncclXmlNode* nvlNode = NULL;
  NCCLCHECK(xmlGetSub(pciNode, "nvlink", &nvlNode));
  if (nvlNode == NULL) {
    // NVML NVLink detection
    int maxNvLinks = (sm < 60) ? 0 : (sm < 70) ? 4 : (sm < 80) ? 6 : 12;

    if (maxNvLinks > 0 && nvmlDev == NULL) {
      WARN("No NVML device handle. Skipping nvlink detection.\n");
      maxNvLinks = 0;
    }
    maxNvLinks = 1;//add by wangying
    TRACE(NCCL_ENV, "inside ncclTopoGetXmlFromGpu_2 %d,%d",maxNvLinks,sm);//addd by wangying

    for (int l=0; l<maxNvLinks; ++l) {
      // Check whether we can use this NVLink for P2P
      unsigned canP2P;
      if ((wrapNvmlDeviceGetNvLinkCapability(nvmlDev, l, NVML_NVLINK_CAP_P2P_SUPPORTED, &canP2P) != ncclSuccess) || !canP2P) continue;
      TRACE(NCCL_ENV, "inside ncclTopoGetXmlFromGpu_3 %d",canP2P);//addd by wangying
      // Make sure the Nvlink is up. The previous call should have trained the link.
      //nvmlEnableState_t isActive;
      //if ((wrapNvmlDeviceGetNvLinkState(nvmlDev, l, &isActive) != ncclSuccess) || (isActive != NVML_FEATURE_ENABLED)) continue;

      // Try to figure out what's on the other side of the NVLink
      //nvmlPciInfo_t remoteProc;
      //if (wrapNvmlDeviceGetNvLinkRemotePciInfo(nvmlDev, l, &remoteProc) != ncclSuccess) continue;

      // Make a lower case copy of the bus ID for calling ncclDeviceType
      // PCI system path is in lower case
      /*
      char* p = remoteProc.busId;
      char lowerId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
      for (int c=0; c<NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE; c++) {
        lowerId[c] = tolower(p[c]);
        if (p[c] == 0) break;
      }
      TRACE(NCCL_ENV, "inside ncclTopoGetXmlFromGpu_3 ");//addd by wangying
      NCCLCHECK(xmlGetSubKv(gpuNode, "nvlink", &nvlNode, "target", lowerId));
      if (nvlNode == NULL) {
        NCCLCHECK(xmlAddNode(xml, gpuNode, "nvlink", &nvlNode));
        NCCLCHECK(xmlSetAttr(nvlNode, "target", lowerId));
        NCCLCHECK(xmlSetAttrInt(nvlNode, "count", 1));
      } else {
        int count;
        NCCLCHECK(xmlGetAttrInt(nvlNode, "count", &count));
        NCCLCHECK(xmlSetAttrInt(nvlNode, "count", count+1));
      }
      */
    }

  }

  // Fill target classes
  TRACE(NCCL_ENV, "inside ncclTopoGetXmlFromGpu_4,%d ",gpuNode->nSubs);//addd by wangying

From the log , it took a minute for wrapNvmlDeviceGetNvLinkCapability(nvmlDev, l, NVML_NVLINK_CAP_P2P_SUPPORTED, &canP2P) to return So I add more traces inside like this

ncclResult_t wrapNvmlDeviceGetNvLinkCapability(nvmlDevice_t device, unsigned int link,
    nvmlNvLinkCapability_t capability, unsigned int *capResult) {
    TRACE(NCCL_ENV, "enter wrapNvmlDeviceGetNvLinkCapability %d",link);//addd by wangying
  if (nvmlInternalDeviceGetNvLinkCapability == NULL) {
    /* Do not warn, this symbol is optional. */
      TRACE(NCCL_ENV, "exit wrapNvmlDeviceGetNvLinkCapability_1 ");//addd by wangying
    return ncclInternalError;
  }
  nvmlReturn_t ret;
  TRACE(NCCL_ENV, "inside wrapNvmlDeviceGetNvLinkCapability_1 %d",link);//addd by wangying
  NVMLLOCKCALL(nvmlInternalDeviceGetNvLinkCapability(device, link, capability, capResult), ret);
  TRACE(NCCL_ENV, "inside wrapNvmlDeviceGetNvLinkCapability_2 %d",ret);//addd by wangying
  if (ret != NVML_SUCCESS) {
      TRACE(NCCL_ENV, "inside wrapNvmlDeviceGetNvLinkCapability_3 %d",ret);//addd by wangying
    if (ret != NVML_ERROR_NOT_SUPPORTED)
      INFO(NCCL_INIT,"nvmlDeviceGetNvLinkCapability() failed: %s ",
          nvmlInternalErrorString(ret));
    TRACE(NCCL_ENV, "inside wrapNvmlDeviceGetNvLinkCapability_4 %d",ret);//addd by wangying
    return ncclSystemError;
  }
  TRACE(NCCL_ENV, "inside wrapNvmlDeviceGetNvLinkCapability_5 %d",ret);//addd by wangying
  return ncclSuccess;
}

To avoid it is a MUTEX LOCK problem, in that case ,some other thread blocks nvmlInternalDeviceGetNvLinkCapability(), I change the nvmlwrap.h , like this, In this way, it will not get blocked and wait in anyway ,it will return NVML_ERROR_IN_USE immediately.

// The NVML library doesn't appear to be thread safe
#include <pthread.h>
extern pthread_mutex_t nvmlLock;
#define NVMLLOCK() pthread_mutex_lock(&nvmlLock)
#define NVMLUNLOCK() pthread_mutex_unlock(&nvmlLock)
#define NVMLTRYLOCK() pthread_mutex_trylock(&nvmlLock)

#define NVMLLOCKCALL(cmd, ret) do {                      \
    ret = NVML_ERROR_IN_USE;                                \
    if(0 ==NVMLTRYLOCK()) {                             \
    ret = cmd;                                           \
    NVMLUNLOCK();                                       \
    }                                               \
} while(false)

With all changes above, and environment variables are: NCCL_DEBUG=TRACE NCCL_IGNORE_CPU_AFFINITY=1 NCCL_DEBUG_SUBSYS=ALL NCCL_DEBUG_FILE=TRACEINFO_2.%p

I ran NCCL-TEST again, the log is like this:


wangying-system:7991:8021 [0] 374.669620 ncclTopoGetXmlFromGpu:583 NCCL TRACE inside ncclTopoGetXmlFromGpu_1 0

wangying-system:7991:8021 [0] 374.705588 ncclTopoGetXmlFromGpu:612 NCCL TRACE inside ncclTopoGetXmlFromGpu_2 1,75

wangying-system:7991:8021 [0] 374.717875 wrapNvmlDeviceGetNvLinkCapability:254 NCCL TRACE enter wrapNvmlDeviceGetNvLinkCapability 0

wangying-system:7991:8021 [0] 374.729553 wrapNvmlDeviceGetNvLinkCapability:261 NCCL TRACE inside wrapNvmlDeviceGetNvLinkCapability_1 0
wangying-system:7991:8021 [0] 60381.859371 wrapNvmlDeviceGetNvLinkCapability:263 NCCL TRACE inside wrapNvmlDeviceGetNvLinkCapability_2 0
wangying-system:7991:8021 [0] 60381.896030 wrapNvmlDeviceGetNvLinkCapability:272 NCCL TRACE inside wrapNvmlDeviceGetNvLinkCapability_5 0
wangying-system:7991:8021 [0] 60381.916536 ncclTopoGetXmlFromGpu:618 NCCL TRACE inside ncclTopoGetXmlFromGpu_3 1
wangying-system:7991:8021 [0] 60381.921225 ncclTopoGetXmlFromGpu:653 NCCL TRACE inside ncclTopoGetXmlFromGpu_4,0 
wangying-system:7991:8021 [0] 60381.925482 ncclTopoGetXmlFromGpu:668 NCCL TRACE finally exit ncclTopoGetXmlFromGpu,0

it shows calling nvmlInternalDeviceGetNvLinkCapability cost 60 seconds, and suprisingly returned successfully, and canP2P is 1 according to TRACE: "wangying-system:7991:8021 [0] 60381.916536 ncclTopoGetXmlFromGpu:618 NCCL TRACE inside ncclTopoGetXmlFromGpu_3 1"

So the original longtime is caused by NVML NVLINK query functions, it is functional but too high latency. Each time ncclTopoGetXmlFromGpu get called , maxNvLinks is 6, it called 6 * 3 NVML NVLINK query functions, and by 2 thread, so that is about 36 mins? but it is less than that, so maybe some calling fails and skip the maxNvLinks loop.

The workaround is comment out the " for (int l=0; l<maxNvLinks; ++l) {" loop to avoid NVLINK check. and the test speed doesn't get affected.

$ ./all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   7991 on wangying-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   7991 on wangying-system device  1 [0x02] GeForce RTX 2080 Ti
NCCL version 2.7.3+cuda10.1
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float     sum    32.53    0.00    0.00  0e+00    32.53    0.00    0.00  0e+00
          16             4   float     sum    32.52    0.00    0.00  0e+00    32.43    0.00    0.00  0e+00
          32             8   float     sum    32.52    0.00    0.00  0e+00    32.55    0.00    0.00  0e+00
          64            16   float     sum    32.50    0.00    0.00  0e+00    32.46    0.00    0.00  0e+00
         128            32   float     sum    32.52    0.00    0.00  0e+00    32.62    0.00    0.00  0e+00
         256            64   float     sum    32.79    0.01    0.01  0e+00    32.60    0.01    0.01  0e+00
         512           128   float     sum    32.57    0.02    0.02  0e+00    32.72    0.02    0.02  0e+00
        1024           256   float     sum    32.86    0.03    0.03  0e+00    32.60    0.03    0.03  0e+00
        2048           512   float     sum    32.86    0.06    0.06  0e+00    32.67    0.06    0.06  0e+00
        4096          1024   float     sum    32.74    0.13    0.13  0e+00    32.87    0.12    0.12  0e+00
        8192          2048   float     sum    33.35    0.25    0.25  0e+00    33.53    0.24    0.24  0e+00
       16384          4096   float     sum    33.49    0.49    0.49  0e+00    33.37    0.49    0.49  0e+00
       32768          8192   float     sum    34.17    0.96    0.96  0e+00    33.98    0.96    0.96  0e+00
       65536         16384   float     sum    34.10    1.92    1.92  0e+00    34.15    1.92    1.92  0e+00
      131072         32768   float     sum    34.62    3.79    3.79  0e+00    34.18    3.83    3.83  0e+00
      262144         65536   float     sum    36.66    7.15    7.15  0e+00    36.37    7.21    7.21  0e+00
      524288        131072   float     sum    46.45   11.29   11.29  0e+00    46.40   11.30   11.30  0e+00
     1048576        262144   float     sum    68.89   15.22   15.22  0e+00    69.10   15.17   15.17  0e+00
     2097152        524288   float     sum    116.6   17.99   17.99  0e+00    115.8   18.11   18.11  0e+00
     4194304       1048576   float     sum    213.2   19.67   19.67  0e+00    211.7   19.82   19.82  0e+00
     8388608       2097152   float     sum    420.7   19.94   19.94  0e+00    486.4   17.25   17.25  0e+00
    16777216       4194304   float     sum    991.1   16.93   16.93  0e+00    779.2   21.53   21.53  0e+00
    33554432       8388608   float     sum   1643.9   20.41   20.41  0e+00   1540.9   21.78   21.78  0e+00
    67108864      16777216   float     sum   3023.0   22.20   22.20  0e+00   3366.7   19.93   19.93  0e+00
   134217728      33554432   float     sum   6317.1   21.25   21.25  0e+00   5971.2   22.48   22.48  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.23946

Hope it will help.

And NVML NVLINK is a library called by NCCL, so I wasn't able to trace inside NVML.

So the question for today is WHY NVML NVLINK related functions SO SLOW???

AlexWang1900 commented 4 years ago

Even more updates: I decided to test NVML under Windows, with Windows 10, 2004, CUDA 10.1, NVIDIA driver: 446.14

I use the code who comes with CUDA 10.1 Windows version, and I compile it with Visual C studio 2017:

/***************************************************************************\
|*                                                                           *|
|*      Copyright 2010-2016 NVIDIA Corporation.  All rights reserved.        *|
|*                                                                           *|
|*   NOTICE TO USER:                                                         *|
|*                                                                           *|
|*   This source code is subject to NVIDIA ownership rights under U.S.       *|
|*   and international Copyright laws.  Users and possessors of this         *|
|*   source code are hereby granted a nonexclusive, royalty-free             *|
|*   license to use this code in individual and commercial software.         *|
|*                                                                           *|
|*   NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE     *|
|*   CODE FOR ANY PURPOSE. IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR         *|
|*   IMPLIED WARRANTY OF ANY KIND. NVIDIA DISCLAIMS ALL WARRANTIES WITH      *|
|*   REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF         *|
|*   MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR          *|
|*   PURPOSE. IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL,            *|
|*   INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES          *|
|*   WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN      *|
|*   AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING     *|
|*   OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOURCE      *|
|*   CODE.                                                                   *|
|*                                                                           *|
|*   U.S. Government End Users. This source code is a "commercial item"      *|
|*   as that term is defined at 48 C.F.R. 2.101 (OCT 1995), consisting       *|
|*   of "commercial computer  software" and "commercial computer software    *|
|*   documentation" as such terms are used in 48 C.F.R. 12.212 (SEPT 1995)   *|
|*   and is provided to the U.S. Government only as a commercial end item.   *|
|*   Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through        *|
|*   227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the       *|
|*   source code with only those rights set forth herein.                    *|
|*                                                                           *|
|*   Any use of this source code in individual and commercial software must  *|
|*   include, in the user documentation and internal comments to the code,   *|
|*   the above Disclaimer and U.S. Government End Users Notice.              *|
|*                                                                           *|
|*                                                                           *|
 \***************************************************************************/

//#include "cuda_kernels.h"
#include <stdio.h>
#include <nvml.h>
#include "windows.h"
#include <stdio.h>  
//#include <windows.h>  
//#include <tlhelp32.h>  
//#include <psapi.h>   
//#pragma comment(lib,"kernel32.lib")  
//#pragma comment(lib,"advapi32.lib")  

#pragma  comment(lib,"nvml.lib")

static const char * convertToComputeModeString(nvmlComputeMode_t mode)
{
    switch (mode)
    {
    case NVML_COMPUTEMODE_DEFAULT:
        return "Default";
    case NVML_COMPUTEMODE_EXCLUSIVE_THREAD:
        return "Exclusive_Thread";
    case NVML_COMPUTEMODE_PROHIBITED:
        return "Prohibited";
    case NVML_COMPUTEMODE_EXCLUSIVE_PROCESS:
        return "Exclusive Process";
    default:
        return "Unknown";
    }
}

int main(void)
{
    nvmlReturn_t result;
    unsigned int device_count, i;

    nvmlNvLinkCapability_t nvlinkCapa;

    // First initialize NVML library
    result = nvmlInit();
    if (NVML_SUCCESS != result)
    {
        printf("Failed to initialize NVML: %s\n", nvmlErrorString(result));

        printf("Press ENTER to continue...\n");
        getchar();
        return 1;
    }

    result = nvmlDeviceGetCount(&device_count);
    if (NVML_SUCCESS != result)
    {
        printf("Failed to query device count: %s\n", nvmlErrorString(result));
        goto Error;
    }
    printf("Found %u device%s\n\n", device_count, device_count != 1 ? "s" : "");

    printf("Listing devices:\n");
    for (i = 0; i < device_count; i++)
    {
        nvmlDevice_t device;
        char name[NVML_DEVICE_NAME_BUFFER_SIZE];
        nvmlPciInfo_t pci;
        nvmlComputeMode_t compute_mode;

        // Query for device handle to perform operations on a device
        // You can also query device handle by other features like:
        // nvmlDeviceGetHandleBySerial
        // nvmlDeviceGetHandleByPciBusId
        result = nvmlDeviceGetHandleByIndex(i, &device);
        if (NVML_SUCCESS != result)
        {
            printf("Failed to get handle for device %u: %s\n", i, nvmlErrorString(result));
            goto Error;
        }

        result = nvmlDeviceGetName(device, name, NVML_DEVICE_NAME_BUFFER_SIZE);
        if (NVML_SUCCESS != result)
        {
            printf("Failed to get name of device %u: %s\n", i, nvmlErrorString(result));
            goto Error;
        }

        // pci.busId is very useful to know which device physically you're talking to
        // Using PCI identifier you can also match nvmlDevice handle to CUDA device.
        result = nvmlDeviceGetPciInfo(device, &pci);
        if (NVML_SUCCESS != result)
        {
            printf("Failed to get pci info for device %u: %s\n", i, nvmlErrorString(result));
            goto Error;
        }

        printf("%u. %s [%s]\n", i, name, pci.busId);

        //add by wangying
        nvlinkCapa = NVML_NVLINK_CAP_P2P_SUPPORTED;
        long curTime = GetTickCount();
        unsigned int canP2p;
        printf("Start calling NVLINK query:%d\n", curTime);
        result = nvmlDeviceGetNvLinkCapability(device, 0, nvlinkCapa, &canP2p);

        if (NVML_SUCCESS != result)
        {
            printf("Failed to get NVLINK info for device %u: %s\n", i, nvmlErrorString(result));
            goto Error;
        }
        printf("SUPPORT P2P:%d\n", canP2p);
        long curTime1 = GetTickCount();
        printf("End calling NVLINK query:%d\n", curTime1);
        printf("Elapsed time:%d ms\n", curTime1 - curTime);
        //add by wangying end
        /*
        // This is a simple example on how you can modify GPU's state
        result = nvmlDeviceGetComputeMode(device, &compute_mode);
        if (NVML_ERROR_NOT_SUPPORTED == result)
            printf("\t This is not CUDA capable device\n");
        else if (NVML_SUCCESS != result)
        {
            printf("Failed to get compute mode for device %u: %s\n", i, nvmlErrorString(result));
            goto Error;
        }
        else
        {
            // try to change compute mode
            printf("\t Changing device's compute mode from '%s' to '%s'\n",
                convertToComputeModeString(compute_mode),
                convertToComputeModeString(NVML_COMPUTEMODE_PROHIBITED));

            result = nvmlDeviceSetComputeMode(device, NVML_COMPUTEMODE_PROHIBITED);
            if (NVML_ERROR_NO_PERMISSION == result)
                printf("\t\t Need root privileges to do that: %s\n", nvmlErrorString(result));
            else if (NVML_ERROR_NOT_SUPPORTED == result)
                printf("\t\t Compute mode prohibited not supported. You might be running on\n"
                    "\t\t windows in WDDM driver model or on non-CUDA capable GPU.\n");
            else if (NVML_SUCCESS != result)
            {
                printf("\t\t Failed to set compute mode for device %u: %s\n", i, nvmlErrorString(result));
                goto Error;
            }
            else
            {
                printf("\t Restoring device's compute mode back to '%s'\n",
                    convertToComputeModeString(compute_mode));
                result = nvmlDeviceSetComputeMode(device, compute_mode);
                if (NVML_SUCCESS != result)
                {
                    printf("\t\t Failed to restore compute mode for device %u: %s\n", i, nvmlErrorString(result));
                    goto Error;
                }
            }
        }*/
    }

    result = nvmlShutdown();
    if (NVML_SUCCESS != result)
        printf("Failed to shutdown NVML: %s\n", nvmlErrorString(result));

    printf("All done.\n");

    printf("Press ENTER to continue...\n");
    getchar();
    return 0;

Error:
    result = nvmlShutdown();
    if (NVML_SUCCESS != result)
        printf("Failed to shutdown NVML: %s\n", nvmlErrorString(result));

    printf("Press ENTER to continue...\n");
    getchar();
    return 1;
}

run result are here :

Found 2 devices

Listing devices:
0. GeForce RTX 2080 Ti [00000000:01:00.0]
Start calling NVLINK query:9166453
SUPPORT P2P:1
End calling NVLINK query:9170109
Elapsed time:3656 ms
1. GeForce RTX 2080 Ti [00000000:02:00.0]
Start calling NVLINK query:9170109
SUPPORT P2P:1
End calling NVLINK query:9170109
Elapsed time:0 ms
All done.
Press ENTER to continue...

It shows there is no issue with my hardware as under Windows environment , NVML returns nvmlDeviceGetNvLinkCapability in 3.6 seconds, while under linux it returns in exact 60 seconds, which I think it is a timer unintentionally invoking that result to return.

sjeaugey commented 4 years ago

Thank you for this detailed report.

So if I understand correctly, the simple test above takes 3.6s on Windows, and 60s on Ubuntu 20.04 (kernel 5.4.0, driver 440.64).

Please let me know if any version is incorrect.

AlexWang1900 commented 4 years ago

Both Nvidia drivers are the newest released from Nvidia website. 1) 3.6s on Windows, with NVIDIA driver: 446.14 2) 60s on Ubuntu 20.04 （Linux version 5.4.0-31-generic (buildd@lgw01-amd64-059) ）
with Nvidia driver 440.82

sjeaugey commented 4 years ago

Actually I had to modify the test to make it run on Linux ... could you confirm you ran that test on Linux ? (or some variation to replace GetTickCount by something else).

I'm trying to make sure it's a problem that's easy to reproduce with a simple test, outside of NCCL and it's not caused by something else in NCCL.

AlexWang1900 commented 4 years ago

I can try to do it tomorrow, it's late I have to sleep now : )

sjeaugey commented 4 years ago

No worries. Thanks a lot.

AlexWang1900 commented 4 years ago

Hi~~ I updated the driver to 440.100, the problem still exists , any good news???

sjeaugey commented 4 years ago

Hi, thanks for checking that. Can you confirm whether the reproducer you made for windows also exhibits the issue on Linux? That would help reproduce the problem internally.

AlexWang1900 commented 4 years ago

Hi~~~~ Very bad news: Finally I get it running!!!!

here is the final result, it took 8 seconds to query nvlink status. using the same code as Windows 3.6s . with driver 440.100. (The code example.c is in CUDA 10.1 folder,it is exactly the same as the code in Windows CUDA 10.1, and comes with a makefile way out of date )

wangying@wangying-system:/usr/local/cuda-10.1/nvml/example$ sudo ./example
Found 2 devices

Listing devices:
0. GeForce RTX 2080 Ti [00000000:01:00.0]
Start calling NVLINK query:676945073
SUPPORT P2P:1
End calling NVLINK query:676953080
Elapsed time:8007 ms
1. GeForce RTX 2080 Ti [00000000:02:00.0]
Start calling NVLINK query:676953087
SUPPORT P2P:1
End calling NVLINK query:676953087
Elapsed time:0 ms
All done.
Press ENTER to continue...

Here is the code , it is almost the same as Windows 3.6s version above ,except for the getTick() is the linux version of time counting.

/***************************************************************************\
|*                                                                           *|
|*      Copyright 2010-2016 NVIDIA Corporation.  All rights reserved.        *|
|*                                                                           *|
|*   NOTICE TO USER:                                                         *|
|*                                                                           *|
|*   This source code is subject to NVIDIA ownership rights under U.S.       *|
|*   and international Copyright laws.  Users and possessors of this         *|
|*   source code are hereby granted a nonexclusive, royalty-free             *|
|*   license to use this code in individual and commercial software.         *|
|*                                                                           *|
|*   NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE     *|
|*   CODE FOR ANY PURPOSE. IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR         *|
|*   IMPLIED WARRANTY OF ANY KIND. NVIDIA DISCLAIMS ALL WARRANTIES WITH      *|
|*   REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF         *|
|*   MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR          *|
|*   PURPOSE. IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL,            *|
|*   INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES          *|
|*   WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN      *|
|*   AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING     *|
|*   OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOURCE      *|
|*   CODE.                                                                   *|
|*                                                                           *|
|*   U.S. Government End Users. This source code is a "commercial item"      *|
|*   as that term is defined at 48 C.F.R. 2.101 (OCT 1995), consisting       *|
|*   of "commercial computer  software" and "commercial computer software    *|
|*   documentation" as such terms are used in 48 C.F.R. 12.212 (SEPT 1995)   *|
|*   and is provided to the U.S. Government only as a commercial end item.   *|
|*   Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through        *|
|*   227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the       *|
|*   source code with only those rights set forth herein.                    *|
|*                                                                           *|
|*   Any use of this source code in individual and commercial software must  *| 
|*   include, in the user documentation and internal comments to the code,   *|
|*   the above Disclaimer and U.S. Government End Users Notice.              *|
|*                                                                           *|
|*                                                                           *|
 \***************************************************************************/

#include <stdio.h>
#include <nvml.h>
#include <unistd.h>
#include <time.h>

unsigned getTick() {
    struct timespec ts;
    unsigned theTick = 0U;
    clock_gettime( CLOCK_REALTIME, &ts );
    theTick  = ts.tv_nsec / 1000000;
    theTick += ts.tv_sec * 1000;
    return theTick;
}

static const char * convertToComputeModeString(nvmlComputeMode_t mode)
{
    switch (mode)
    {
        case NVML_COMPUTEMODE_DEFAULT:
            return "Default";
        case NVML_COMPUTEMODE_EXCLUSIVE_THREAD:
            return "Exclusive_Thread";
        case NVML_COMPUTEMODE_PROHIBITED:
            return "Prohibited";
        case NVML_COMPUTEMODE_EXCLUSIVE_PROCESS:
            return "Exclusive Process";
        default:
            return "Unknown";
    }
}

int main(void)
{
    nvmlReturn_t result;
    unsigned int device_count, i;
    nvmlNvLinkCapability_t nvlinkCapa;
    // First initialize NVML library
    result = nvmlInit();
    if (NVML_SUCCESS != result)
    { 
        printf("Failed to initialize NVML: %s\n", nvmlErrorString(result));

        printf("Press ENTER to continue...\n");
        getchar();
        return 1;
    }

    result = nvmlDeviceGetCount(&device_count);
    if (NVML_SUCCESS != result)
    { 
        printf("Failed to query device count: %s\n", nvmlErrorString(result));
        goto Error;
    }
    printf("Found %u device%s\n\n", device_count, device_count != 1 ? "s" : "");

    printf("Listing devices:\n");    
    for (i = 0; i < device_count; i++)
    {
        nvmlDevice_t device;
        char name[NVML_DEVICE_NAME_BUFFER_SIZE];
        nvmlPciInfo_t pci;
        nvmlComputeMode_t compute_mode;

        // Query for device handle to perform operations on a device
        // You can also query device handle by other features like:
        // nvmlDeviceGetHandleBySerial
        // nvmlDeviceGetHandleByPciBusId
        result = nvmlDeviceGetHandleByIndex(i, &device);
        if (NVML_SUCCESS != result)
        { 
            printf("Failed to get handle for device %u: %s\n", i, nvmlErrorString(result));
            goto Error;
        }

        result = nvmlDeviceGetName(device, name, NVML_DEVICE_NAME_BUFFER_SIZE);
        if (NVML_SUCCESS != result)
        { 
            printf("Failed to get name of device %u: %s\n", i, nvmlErrorString(result));
            goto Error;
        }

        // pci.busId is very useful to know which device physically you're talking to
        // Using PCI identifier you can also match nvmlDevice handle to CUDA device.
        result = nvmlDeviceGetPciInfo(device, &pci);
        if (NVML_SUCCESS != result)
        { 
            printf("Failed to get pci info for device %u: %s\n", i, nvmlErrorString(result));
            goto Error;
        }

        printf("%u. %s [%s]\n", i, name, pci.busId);

    //add by wangying
    nvlinkCapa = NVML_NVLINK_CAP_P2P_SUPPORTED;
    long curTime = getTick();//GetTickCount();
    unsigned int canP2p;
    printf("Start calling NVLINK query:%ld\n", curTime);
    result = nvmlDeviceGetNvLinkCapability(device, 0, nvlinkCapa, &canP2p);

    if (NVML_SUCCESS != result)
    {
        printf("Failed to get NVLINK info for device %u: %s\n", i, nvmlErrorString(result));
        goto Error;
    }
    printf("SUPPORT P2P:%d\n", canP2p);
    long curTime1 = getTick();//GetTickCount();
    printf("End calling NVLINK query:%ld\n", curTime1);
    printf("Elapsed time:%ld ms\n", (curTime1 - curTime));
    //add by wangying end

        // This is a simple example on how you can modify GPU's state
        /*
        result = nvmlDeviceGetComputeMode(device, &compute_mode);
        if (NVML_ERROR_NOT_SUPPORTED == result)
            printf("\t This is not CUDA capable device\n");
        else if (NVML_SUCCESS != result)
        { 
            printf("Failed to get compute mode for device %u: %s\n", i, nvmlErrorString(result));
            goto Error;
        }
        else
        {
            // try to change compute mode
            printf("\t Changing device's compute mode from '%s' to '%s'\n", 
                    convertToComputeModeString(compute_mode), 
                    convertToComputeModeString(NVML_COMPUTEMODE_PROHIBITED));

            result = nvmlDeviceSetComputeMode(device, NVML_COMPUTEMODE_PROHIBITED);
            if (NVML_ERROR_NO_PERMISSION == result)
                printf("\t\t Need root privileges to do that: %s\n", nvmlErrorString(result));
            else if (NVML_ERROR_NOT_SUPPORTED == result)
                printf("\t\t Compute mode prohibited not supported. You might be running on\n"
                       "\t\t windows in WDDM driver model or on non-CUDA capable GPU.\n");
            else if (NVML_SUCCESS != result)
            {
                printf("\t\t Failed to set compute mode for device %u: %s\n", i, nvmlErrorString(result));
                goto Error;
            } 
            else
            {
                printf("\t Restoring device's compute mode back to '%s'\n", 
                        convertToComputeModeString(compute_mode));
                result = nvmlDeviceSetComputeMode(device, compute_mode);
                if (NVML_SUCCESS != result)
                { 
                    printf("\t\t Failed to restore compute mode for device %u: %s\n", i, nvmlErrorString(result));
                    goto Error;
                }
            }
        }
        */
    }

    result = nvmlShutdown();
    if (NVML_SUCCESS != result)
        printf("Failed to shutdown NVML: %s\n", nvmlErrorString(result));

    printf("All done.\n");

    printf("Press ENTER to continue...\n");
    getchar();
    return 0;

Error:
    result = nvmlShutdown();
    if (NVML_SUCCESS != result)
        printf("Failed to shutdown NVML: %s\n", nvmlErrorString(result));

    printf("Press ENTER to continue...\n");
    getchar();
    return 1;
}

And the most interesting part is here, as I changed the makefile to this:

ARCH       := $(shell getconf LONG_BIT)
OS         := $(shell cat /etc/issue)
#RHEL_OS    := $(shell cat /etc/redhat-release)

# Gets Driver Branch
DRIVER_BRANCH := $(shell nvidia-smi | grep Driver | cut -f 3 -d' ' | cut -f 1 -d '.')

# Location of the CUDA Toolkit
CUDA_PATH ?= "/usr/local/cuda-10.1"

ifeq (${ARCH},$(filter ${ARCH},32 64))
    # If correct architecture and libnvidia-ml library is not found 
    # within the environment, build using the stub library

    ifneq (,$(findstring Ubuntu,$(OS)))
        DEB := $(shell dpkg -l | grep cuda)
        ifneq (,$(findstring cuda, $(DEB)))
            NVML_LIB := /usr/lib/x86_64-linux-gnu
            #/usr/lib/nvidia-$(DRIVER_BRANCH)
        else 
            NVML_LIB := /usr/lib/x86_64-linux-gnu
            #/lib${ARCH}
        endif
    endif

    ifneq (,$(findstring SUSE,$(OS)))
        RPM := $(shell rpm -qa cuda*)
        ifneq (,$(findstring cuda, $(RPM)))
            NVML_LIB := /usr/lib${ARCH}
        else
            NVML_LIB := /lib${ARCH}
        endif
    endif

    ifneq (,$(findstring CentOS,$(RHEL_OS)))
        RPM := $(shell rpm -qa cuda*)
        ifneq (,$(findstring cuda, $(RPM)))
            NVML_LIB := /usr/lib${ARCH}/nvidia
        else
            NVML_LIB := /lib${ARCH}
        endif
    endif

    ifneq (,$(findstring Red Hat,$(RHEL_OS)))
        RPM := $(shell rpm -qa cuda*)
        ifneq (,$(findstring cuda, $(RPM)))
            NVML_LIB := /usr/lib${ARCH}/nvidia
        else
            NVML_LIB := /lib${ARCH}
        endif
    endif

    ifneq (,$(findstring Fedora,$(RHEL_OS)))
        RPM := $(shell rpm -qa cuda*)
        ifneq (,$(findstring cuda, $(RPM)))
            NVML_LIB := /usr/lib${ARCH}/nvidia
        else
            NVML_LIB := /lib${ARCH}
        endif
    endif

else
    NVML_LIB := ../../lib${ARCH}/stubs/
    $(info "libnvidia-ml.so.1" not found, using stub library.)
endif

ifneq (${ARCH},$(filter ${ARCH},32 64))
    $(error Unknown architecture!)
endif

#NVML_LIB += ../lib/
NVML_LIB_L := $(addprefix -L , $(NVML_LIB))

CFLAGS  := -I ../../include -I ../include
LDFLAGS := -lnvidia-ml $(NVML_LIB_L)

all: example supportedVgpus
example: example.o
    $(CC) $< $(CFLAGS) $(LDFLAGS) -o $@
supportedVgpus: supportedVgpus.o
    $(CC) $< $(CFLAGS) $(LDFLAGS) -o $@
clean:
    -@rm -f example.o
    -@rm -f example
    -@rm -f supportedVgpus.o
    -@rm -f supportedVgpus

The makefile runs like this:

wangying@wangying-system:/usr/local/cuda-10.1/nvml/example$ sudo make example
cc -I ../../include -I ../include   -c -o example.o example.c
cc example.o -I ../../include -I ../include -lnvidia-ml -L /usr/lib/x86_64-linux-gnu -o example

The main problem is the libnvidia-ml.so file. I changed it in makefile to /usr/lib/x86_64-linux-gnu/libnvidia-ml.so , and created a soft-link libnvidia-ml.so to libnvidia-ml.so.440.100 ( 1.5 MB) ,then it runs well.

The original and only libnvidia-ml.so in my system is in /usr/lib/i386-linux-gnu and linked to libnvidia-ml.so.1 linked to libnvidia-ml.so.440.100 (1.6MB ) in that folder , which I guess is the 32 bit version of libnvidia-ml.so

It seems the problem is : when a program calls nvlink query , it may call the 32 bit version of nvidia-ml.so ?

But I compile and link example.c again with the 32-bit version of libnvidia-ml.so , linux ld seems skip linking 32-bit version of nvidia-ml.so ,says it is not compatible.

And I compile NCCL 2.7.3 again, then run NCCL-TESTS, it still took 60s for nvmlDeviceGetNvLinkCapability to return.

So the problem stays in NCCL, but I can't explain why it took NCCL 60s calling nvmlDeviceGetNvLinkCapability, but 8 seconds for example.cc and 3.6s in Windows.

sjeaugey commented 4 years ago

Interesting, although I'm still unsure what's going on here.

Can you run ldd example in both cases to see which NVML library your test is using ? I would really be surprised if we're using the 32 bit version in the case that is slow.

The fact NCCL is still slow could be because NCCL does not link against NVML, it loads the library dynamically, so it might end up using the wrong one.

AlexWang1900 commented 4 years ago

which two cases should I use ldd example? for windows case I copied the nvml.dll to the source folder.

---Original--- From: "Sylvain Jeaugey"<notifications@github.com> Date: Wed, Jul 8, 2020 00:42 AM To: "NVIDIA/nccl"<nccl@noreply.github.com>; Cc: "Author"<author@noreply.github.com>;"AlexWang1900"<41888506@qq.com>; Subject: Re: [NVIDIA/nccl] NCCL 2.6.4 makes system hanging. (#340)

Interesting, although I'm still unsure what's going on here.

Can you run ldd example in both cases to see which NVML library your test is using ? I would really be surprised if we're using the 32 bit version in the case that is slow.

The fact NCCL is still slow could be because NCCL does not link against NVML, it loads the library dynamically, so it might end up using the wrong one.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

sjeaugey commented 4 years ago

You mentioned on Linux, first you could reproduce the issue (the test took 8 seconds to complete). Then changing the Makefile you could get it to run normally.

So in each case, it would be good to run ldd to understand which library it was using and why the difference.

AlexWang1900 commented 4 years ago

I may have confused you,I can manage to run example.c on linux in only one condition: that is the case took 8 seconds and with the makefile changed. I thought 8 seconds was normal compares 60 seconds. and in the same level as 3.6sec on Windows.

any other change like switch lib version, location, will not pass the gcc compiling+linking.

I think I should try to find out which version NCCL use when the 60s case happens. But how?

---Original--- From: "Sylvain Jeaugey"<notifications@github.com> Date: Wed, Jul 8, 2020 01:13 AM To: "NVIDIA/nccl"<nccl@noreply.github.com>; Cc: "Author"<author@noreply.github.com>;"AlexWang1900"<41888506@qq.com>; Subject: Re: [NVIDIA/nccl] NCCL 2.6.4 makes system hanging. (#340)

You mentioned on Linux, first you could reproduce the issue (the test took 8 seconds to complete). Then changing the Makefile you could get it to run normally.

So in each case, it would be good to run ldd to understand which library it was using and why the difference.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

sjeaugey commented 4 years ago

Right, I got confused indeed.

8 seconds seems still high, but it is only a problem if it's 8 seconds per call. And it seems the call on GPU 1 did not take any time, so I'm wondering if each NVLink test would take 8 seconds. Could you replace :

  // get time...
  result = nvmlDeviceGetNvLinkCapability(device, 0, nvlinkCapa, &canP2p);
  // print time...

        for (int link=0; link<NVML_NVLINK_MAX_LINKS; link++) {
             // get time ...
             result = nvmlDeviceGetNvLinkCapability(device, link, nvlinkCapa, &canP2p);
             // print time ...
        }

And see if it also takes 8s for the other NVLinks ?

All that said, I do not see why it would take 8 seconds inside the test and 60 seconds in NCCL, but the fact it's almost a round number of seconds is suggesting we're hitting a timeout in a special error case.

Also we could not reproduce the problem on a similar setup (2x RTX 2080Ti, Ubuntu 20.04, Driver 440.100), so it might be an issue specific to your platform / device.

Note : on 20.04 you can compile with nvcc -ccbin gcc-7 to simplify the path detection (you need to install gcc-7 with apt if not done already), e.g. :

/usr/local/cuda/bin/nvcc -ccbin gcc-7 example.c -lnvidia-ml

That avoids the complex Makefile.

AlexWang1900 commented 4 years ago

Thanks a lot !!! 1 , ldd result:

wangying@wangying-system:/usr/local/cuda/nvml/example$ ldd example
    linux-vdso.so.1 (0x00007ffdb25f0000)
    libnvidia-ml.so.1 => /lib/x86_64-linux-gnu/libnvidia-ml.so.1 (0x00007effb26ce000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007effb24dc000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007effb24b9000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007effb24b3000)

wangying@wangying-system:~/nccl-tests-master/build$ ldd all_reduce_perf
    linux-vdso.so.1 (0x00007ffe98b9a000)
    libcudart.so.10.1 => /usr/local/cuda/lib64/libcudart.so.10.1 (0x00007f02eb6c9000)
    libnccl.so.2 => /usr/local/cuda/lib64/libnccl.so.2 (0x00007f02e6841000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f02e6803000)
    libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f02e6622000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f02e6430000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f02eba47000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f02e642a000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f02e641d000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f02e62ce000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f02e62b3000)

wangying@wangying-system:/usr/local/cuda/lib64$  ldd libnccl.so.2.6.4
    linux-vdso.so.1 (0x00007ffd828fd000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fec666f3000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fec666e8000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fec666e2000)
    libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fec66501000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fec663b2000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fec66397000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fec661a3000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fec6b5bb000)

it seems nccl has no link to libnvml

2 , test "example" for other nvlink id greater than 0 I test 4 cases: case 1: test 6 times for nvlink id 0:

//add by wangying
    nvlinkCapa = NVML_NVLINK_CAP_P2P_SUPPORTED;
    unsigned int canP2p;
    for (int link=0; link<6; link++) {
        long curTime = getTick();//GetTickCount();

        printf("Start calling NVLINK query:%ld\n", curTime);
        result = nvmlDeviceGetNvLinkCapability(device, 0, nvlinkCapa, &canP2p);

        if (NVML_SUCCESS != result)
        {
            printf("Failed to get NVLINK info for device %u: %s\n", i, nvmlErrorString(result));
            //goto Error;
        }
        printf("SUPPORT P2P:%d\n", canP2p);
        long curTime1 = getTick();//GetTickCount();
        printf("End calling NVLINK query:%ld\n", curTime1);
        printf("Elapsed time:%ld ms\n", (curTime1 - curTime));
    }
    //add by wangying end

result: each query cost 8s for card 0, and 0ms for card 1

$ sudo ./example
Found 2 devices

Listing devices:
0. GeForce RTX 2080 Ti [00000000:01:00.0]
Start calling NVLINK query:830722690
SUPPORT P2P:1
End calling NVLINK query:830730697
Elapsed time:8007 ms
Start calling NVLINK query:830730697
SUPPORT P2P:1
End calling NVLINK query:830738706
Elapsed time:8009 ms
Start calling NVLINK query:830738706
SUPPORT P2P:1
End calling NVLINK query:830746713
Elapsed time:8007 ms
Start calling NVLINK query:830746714
SUPPORT P2P:1
End calling NVLINK query:830754722
Elapsed time:8008 ms
Start calling NVLINK query:830754722
SUPPORT P2P:1
End calling NVLINK query:830762729
Elapsed time:8007 ms
Start calling NVLINK query:830762729
SUPPORT P2P:1
End calling NVLINK query:830770737
Elapsed time:8008 ms
1. GeForce RTX 2080 Ti [00000000:02:00.0]
Start calling NVLINK query:830770746
SUPPORT P2P:1
End calling NVLINK query:830770747
Elapsed time:1 ms
Start calling NVLINK query:830770747
SUPPORT P2P:1
End calling NVLINK query:830770747
Elapsed time:0 ms
Start calling NVLINK query:830770747
SUPPORT P2P:1
End calling NVLINK query:830770747
Elapsed time:0 ms
Start calling NVLINK query:830770747
SUPPORT P2P:1
End calling NVLINK query:830770747
Elapsed time:0 ms
Start calling NVLINK query:830770747
SUPPORT P2P:1
End calling NVLINK query:830770747
Elapsed time:0 ms
Start calling NVLINK query:830770747
SUPPORT P2P:1
End calling NVLINK query:830770748
Elapsed time:1 ms
All done.
Press ENTER to continue...

case 2: query from nvlink id 0 to 5 code change to :result = nvmlDeviceGetNvLinkCapability(device, link, nvlinkCapa, &canP2p); result: same result as case 1:

$ sudo ./example
Found 2 devices

Listing devices:
0. GeForce RTX 2080 Ti [00000000:01:00.0]
Start calling NVLINK query:831109742
SUPPORT P2P:1
End calling NVLINK query:831117749
Elapsed time:8007 ms
Start calling NVLINK query:831117749
SUPPORT P2P:1
End calling NVLINK query:831125757
Elapsed time:8008 ms
Start calling NVLINK query:831125757
SUPPORT P2P:1
End calling NVLINK query:831133765
Elapsed time:8008 ms
Start calling NVLINK query:831133765
SUPPORT P2P:1
End calling NVLINK query:831141773
Elapsed time:8008 ms
Start calling NVLINK query:831141773
SUPPORT P2P:1
End calling NVLINK query:831149781
Elapsed time:8008 ms
Start calling NVLINK query:831149781
SUPPORT P2P:1
End calling NVLINK query:831157789
Elapsed time:8008 ms
1. GeForce RTX 2080 Ti [00000000:02:00.0]
Start calling NVLINK query:831157796
SUPPORT P2P:1
End calling NVLINK query:831157796
Elapsed time:0 ms
Start calling NVLINK query:831157796
SUPPORT P2P:1
End calling NVLINK query:831157796
Elapsed time:0 ms
Start calling NVLINK query:831157796
SUPPORT P2P:1
End calling NVLINK query:831157796
Elapsed time:0 ms
Start calling NVLINK query:831157796
SUPPORT P2P:1
End calling NVLINK query:831157796
Elapsed time:0 ms
Start calling NVLINK query:831157796
SUPPORT P2P:1
End calling NVLINK query:831157796
Elapsed time:0 ms
Start calling NVLINK query:831157796
SUPPORT P2P:1
End calling NVLINK query:831157796
Elapsed time:0 ms
All done.
Press ENTER to continue...

case 3, upon case2 ,query card 1 first then card0 result: card 1 cost 0ms, card0 cost 8s

$ sudo ./example
Found 2 devices

Listing devices:
1. GeForce RTX 2080 Ti [00000000:02:00.0]
Start calling NVLINK query:831569359
SUPPORT P2P:1
End calling NVLINK query:831569359
Elapsed time:0 ms
Start calling NVLINK query:831569359
SUPPORT P2P:1
End calling NVLINK query:831569359
Elapsed time:0 ms
Start calling NVLINK query:831569359
SUPPORT P2P:1
End calling NVLINK query:831569359
Elapsed time:0 ms
Start calling NVLINK query:831569359
SUPPORT P2P:1
End calling NVLINK query:831569359
Elapsed time:0 ms
Start calling NVLINK query:831569359
SUPPORT P2P:1
End calling NVLINK query:831569359
Elapsed time:0 ms
Start calling NVLINK query:831569359
SUPPORT P2P:1
End calling NVLINK query:831569359
Elapsed time:0 ms
0. GeForce RTX 2080 Ti [00000000:01:00.0]
Start calling NVLINK query:831569365
SUPPORT P2P:1
End calling NVLINK query:831577373
Elapsed time:8008 ms
Start calling NVLINK query:831577373
SUPPORT P2P:1
End calling NVLINK query:831585381
Elapsed time:8008 ms
Start calling NVLINK query:831585381
SUPPORT P2P:1
End calling NVLINK query:831593389
Elapsed time:8008 ms
Start calling NVLINK query:831593389
SUPPORT P2P:1
End calling NVLINK query:831601397
Elapsed time:8008 ms
Start calling NVLINK query:831601397
SUPPORT P2P:1
End calling NVLINK query:831609405
Elapsed time:8008 ms
Start calling NVLINK query:831609405
SUPPORT P2P:1
End calling NVLINK query:831617413
Elapsed time:8008 ms
Press ENTER to continue...

case 4 query sequence: 0->1->0 result: for card0, 8 seconds, for card 1, 0ms

AlexWang1900 commented 4 years ago

Finally!! At last!!!!! Thanks!!! OMG!!!! LOL!!!!!!!!

for the test above ,I decided to swith pci slot for two cards, after I unplug the nvlink bridge, and card 0, I found out the nvlink bridge cannot plug in fully because the nvlink plastic case is 3-4mm thicker, it got blocked from watercooling kit on card 0. and for card 1, it can be fully plugged in.

I tear up the NVLINK BRIDGE, use only the pcb board and outer case, and plug it in fully.

Now everything runs smoothly!!

example.c: 0ms

/usr/local/cuda/nvml/example$ sudo ./example
Found 2 devices

Listing devices:
0. GeForce RTX 2080 Ti [00000000:01:00.0]
Start calling NVLINK query:839863977
SUPPORT P2P:1
End calling NVLINK query:839863977
Elapsed time:0 ms
Start calling NVLINK query:839863977
SUPPORT P2P:1
End calling NVLINK query:839863977
Elapsed time:0 ms
Start calling NVLINK query:839863977
SUPPORT P2P:1
End calling NVLINK query:839863977
Elapsed time:0 ms
Start calling NVLINK query:839863977
SUPPORT P2P:1
End calling NVLINK query:839863977
Elapsed time:0 ms
Start calling NVLINK query:839863977
SUPPORT P2P:1
End calling NVLINK query:839863977
Elapsed time:0 ms
Start calling NVLINK query:839863977
SUPPORT P2P:1
End calling NVLINK query:839863977
Elapsed time:0 ms
1. GeForce RTX 2080 Ti [00000000:02:00.0]
Start calling NVLINK query:839863982
SUPPORT P2P:1
End calling NVLINK query:839863983
Elapsed time:1 ms
Start calling NVLINK query:839863983
SUPPORT P2P:1
End calling NVLINK query:839863983
Elapsed time:0 ms
Start calling NVLINK query:839863983
SUPPORT P2P:1
End calling NVLINK query:839863983
Elapsed time:0 ms
Start calling NVLINK query:839863983
SUPPORT P2P:1
End calling NVLINK query:839863983
Elapsed time:0 ms
Start calling NVLINK query:839863983
SUPPORT P2P:1
End calling NVLINK query:839863983
Elapsed time:0 ms
Start calling NVLINK query:839863983
SUPPORT P2P:1
End calling NVLINK query:839863983
Elapsed time:0 ms
0. GeForce RTX 2080 Ti [00000000:01:00.0]
Start calling NVLINK query:839863983
SUPPORT P2P:1
End calling NVLINK query:839863983
Elapsed time:0 ms
Start calling NVLINK query:839863983
SUPPORT P2P:1
End calling NVLINK query:839863984
Elapsed time:1 ms
Start calling NVLINK query:839863984
SUPPORT P2P:1
End calling NVLINK query:839863984
Elapsed time:0 ms
Start calling NVLINK query:839863984
SUPPORT P2P:1
End calling NVLINK query:839863984
Elapsed time:0 ms
Start calling NVLINK query:839863984
SUPPORT P2P:1
End calling NVLINK query:839863984
Elapsed time:0 ms
Start calling NVLINK query:839863984
SUPPORT P2P:1
End calling NVLINK query:839863984
Elapsed time:0 ms
All done.
Press ENTER to continue...

nccl-tests runs less than 1second, and doubles the bandwidth.


/nccl-tests-master/build$ ./all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   3576 on wangying-system device  0 [0x01] GeForce RTX 2080 Ti
#   Rank  1 Pid   3576 on wangying-system device  1 [0x02] GeForce RTX 2080 Ti
NCCL version 2.6.4+cuda10.1
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float     sum    14.30    0.00    0.00  0e+00    13.73    0.00    0.00  0e+00
          16             4   float     sum    23.49    0.00    0.00  0e+00    13.39    0.00    0.00  0e+00
          32             8   float     sum    13.96    0.00    0.00  0e+00    13.29    0.00    0.00  0e+00
          64            16   float     sum    14.00    0.00    0.00  0e+00    13.37    0.00    0.00  0e+00
         128            32   float     sum    14.68    0.01    0.01  0e+00    13.87    0.01    0.01  0e+00
         256            64   float     sum    14.00    0.02    0.02  0e+00    13.42    0.02    0.02  0e+00
         512           128   float     sum    14.01    0.04    0.04  0e+00    13.59    0.04    0.04  0e+00
        1024           256   float     sum    14.14    0.07    0.07  0e+00    13.44    0.08    0.08  0e+00
        2048           512   float     sum    13.75    0.15    0.15  0e+00    35.19    0.06    0.06  0e+00
        4096          1024   float     sum    13.65    0.30    0.30  0e+00    13.85    0.30    0.30  0e+00
        8192          2048   float     sum    74.19    0.11    0.11  0e+00    13.41    0.61    0.61  0e+00
       16384          4096   float     sum    15.60    1.05    1.05  0e+00    14.74    1.11    1.11  0e+00
       32768          8192   float     sum    17.93    1.83    1.83  0e+00    19.26    1.70    1.70  0e+00
       65536         16384   float     sum    20.00    3.28    3.28  0e+00    19.46    3.37    3.37  0e+00
      131072         32768   float     sum    24.55    5.34    5.34  0e+00    23.82    5.50    5.50  0e+00
      262144         65536   float     sum    41.89    6.26    6.26  0e+00    55.21    4.75    4.75  0e+00
      524288        131072   float     sum    46.52   11.27   11.27  0e+00    46.73   11.22   11.22  0e+00
     1048576        262144   float     sum    55.81   18.79   18.79  0e+00    55.87   18.77   18.77  0e+00
     2097152        524288   float     sum    80.77   25.96   25.96  0e+00    76.93   27.26   27.26  0e+00
     4194304       1048576   float     sum    127.4   32.92   32.92  0e+00    190.7   22.00   22.00  0e+00
     8388608       2097152   float     sum    259.8   32.29   32.29  0e+00    220.5   38.04   38.04  0e+00
    16777216       4194304   float     sum    403.9   41.53   41.53  0e+00    404.2   41.51   41.51  0e+00
    33554432       8388608   float     sum    869.4   38.60   38.60  0e+00    853.9   39.30   39.30  0e+00
    67108864      16777216   float     sum   1700.6   39.46   39.46  0e+00   1505.6   44.57   44.57  0e+00
   134217728      33554432   float     sum   3159.4   42.48   42.48  0e+00   2975.7   45.11   45.11  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 12.1413

sjeaugey commented 4 years ago

Thanks for the feedback. Happy to see it was solved.

Could you confirm whether there was any message in dmesg or /var/log/syslog about NVLink detection timing out ? I'm realizing I never asked you to check the kernel logs, and it might be good to document here what the potential error messages are.

Let me know, thanks !

AlexWang1900 commented 4 years ago

I checked the /var/log/syslog on July. 7th which is the most recent day I have an NCCL-TEST 60s happend. I can't find anything suspicious regarding NVIDIA, NVLINK , pci, pci-e nvml 60 seconds, 6000 ms , timer, time out, and for "dmesg" it has 3 records which happen only when PC power on showing something successfully started.

sjeaugey commented 4 years ago

Thanks. Also, feel free to close the bug if all is understood.