NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.15k stars 794 forks source link

NCCL hang after before training loop #1160

Open TeddLi opened 8 months ago

TeddLi commented 8 months ago

Hi there, I try to run this test (https://github.com/pytorch/examples/tree/main/distributed/FSDP) to check if my cuda and GPU works fine. I disabled both ACS and IOMMU. But the process always hang before in there.And Ctrl + C won't kill it Every time I have to restart server

r0 Training Epoch:   0%|                                                                                                                                      | 0/188 [00:00<?, ?it/s][I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=_ALLGATHER_BASE, TensorShape=[1], TensorDtypes=Int, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
g3-xlarge-x86-dal-1:3183:3183 [0] NCCL INFO Bootstrap : Using eno1:160.202.129.119<0>
g3-xlarge-x86-dal-1:3183:3183 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
g3-xlarge-x86-dal-1:3183:3183 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
g3-xlarge-x86-dal-1:3183:3183 [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.18.1+cuda12.1
g3-xlarge-x86-dal-1:3342:3342 [1] NCCL INFO cudaDriverVersion 12010
g3-xlarge-x86-dal-1:3342:3342 [1] NCCL INFO Bootstrap : Using eno1:160.202.129.119<0>
g3-xlarge-x86-dal-1:3342:3342 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
g3-xlarge-x86-dal-1:3342:3342 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
g3-xlarge-x86-dal-1:3342:3734 [1] NCCL INFO Failed to open libibverbs.so[.1]
g3-xlarge-x86-dal-1:3342:3734 [1] NCCL INFO NET/Socket : Using [0]eno1:160.202.129.119<0> [1]eno2:10.87.1.117<0>
g3-xlarge-x86-dal-1:3342:3734 [1] NCCL INFO Using network Socket
g3-xlarge-x86-dal-1:3183:3733 [0] NCCL INFO Failed to open libibverbs.so[.1]
g3-xlarge-x86-dal-1:3183:3733 [0] NCCL INFO NET/Socket : Using [0]eno1:160.202.129.119<0> [1]eno2:10.87.1.117<0>
g3-xlarge-x86-dal-1:3183:3733 [0] NCCL INFO Using network Socket
TeddLi commented 8 months ago

Also I passed local NCCL test

root@g3-xlarge-x86-dal-1:/home/ubuntu/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   3096 on g3-xlarge-x86-dal-1 device  0 [0x01] NVIDIA H100 PCIe
#  Rank  1 Group  0 Pid   3096 on g3-xlarge-x86-dal-1 device  1 [0x21] NVIDIA H100 PCIe
#  Rank  2 Group  0 Pid   3096 on g3-xlarge-x86-dal-1 device  2 [0x41] NVIDIA H100 PCIe
#  Rank  3 Group  0 Pid   3096 on g3-xlarge-x86-dal-1 device  3 [0x61] NVIDIA H100 PCIe
#  Rank  4 Group  0 Pid   3096 on g3-xlarge-x86-dal-1 device  4 [0x81] NVIDIA H100 PCIe
#  Rank  5 Group  0 Pid   3096 on g3-xlarge-x86-dal-1 device  5 [0xa1] NVIDIA H100 PCIe
#  Rank  6 Group  0 Pid   3096 on g3-xlarge-x86-dal-1 device  6 [0xc1] NVIDIA H100 PCIe
#  Rank  7 Group  0 Pid   3096 on g3-xlarge-x86-dal-1 device  7 [0xe1] NVIDIA H100 PCIe
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    33.98    0.00    0.00      0    23.26    0.00    0.00      0
          16             4     float     sum      -1    23.09    0.00    0.00      0    23.29    0.00    0.00      0
          32             8     float     sum      -1    23.10    0.00    0.00      0    23.26    0.00    0.00      0
          64            16     float     sum      -1    23.21    0.00    0.00      0    23.53    0.00    0.00      0
         128            32     float     sum      -1    23.43    0.01    0.01      0    23.08    0.01    0.01      0
         256            64     float     sum      -1    23.44    0.01    0.02      0    23.28    0.01    0.02      0
         512           128     float     sum      -1    26.47    0.02    0.03      0    24.22    0.02    0.04      0
        1024           256     float     sum      -1    27.07    0.04    0.07      0    23.27    0.04    0.08      0
        2048           512     float     sum      -1    23.40    0.09    0.15      0    23.62    0.09    0.15      0
        4096          1024     float     sum      -1    23.64    0.17    0.30      0    23.31    0.18    0.31      0
        8192          2048     float     sum      -1    24.08    0.34    0.60      0    23.68    0.35    0.61      0
       16384          4096     float     sum      -1    23.66    0.69    1.21      0    23.90    0.69    1.20      0
       32768          8192     float     sum      -1    24.50    1.34    2.34      0    23.69    1.38    2.42      0
       65536         16384     float     sum      -1    24.93    2.63    4.60      0    25.00    2.62    4.59      0
      131072         32768     float     sum      -1    31.99    4.10    7.17      0    29.86    4.39    7.68      0
      262144         65536     float     sum      -1    88.35    2.97    5.19      0    87.11    3.01    5.27      0
      524288        131072     float     sum      -1    105.4    4.97    8.70      0    110.2    4.76    8.32      0
     1048576        262144     float     sum      -1    108.9    9.63   16.85      0    109.7    9.56   16.72      0
     2097152        524288     float     sum      -1    155.5   13.49   23.61      0    170.1   12.33   21.57      0
     4194304       1048576     float     sum      -1    301.8   13.90   24.32      0    300.8   13.94   24.40      0
     8388608       2097152     float     sum      -1    617.3   13.59   23.78      0    605.7   13.85   24.24      0
    16777216       4194304     float     sum      -1   1246.0   13.46   23.56      0   1225.7   13.69   23.95      0
    33554432       8388608     float     sum      -1   2538.0   13.22   23.14      0   2544.2   13.19   23.08      0
    67108864      16777216     float     sum      -1   5056.2   13.27   23.23      0   5056.2   13.27   23.23      0
   134217728      33554432     float     sum      -1    10089   13.30   23.28      0    10108   13.28   23.24      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.46603 
#
TeddLi commented 8 months ago

@peterhj @Flamefire @aaronp24 @chr1sj0nes

sjeaugey commented 8 months ago

Are you sure your NCCL environment is the same on both runs? Perhaps compare the log of the two runs? In particular, it seems NCCL_SOCKET_IFNAME is set to eno in the PyTorch run, leading to NCCL using both eno1 and eno2:

g3-xlarge-x86-dal-1:3342:3734 [1] NCCL INFO NET/Socket : Using [0]eno1:160.202.129.119<0> [1]eno2:10.87.1.117<0>

Is that what you were using with the NCCL perf tests to get 23 GB/s?