Test NCCL failure common.cu:954 'unhandled cuda error" when test on >2 GPUs

caopulan commented 8 months ago

7 * H100, raise this error when test on more than 2 gpus ./build/broadcast_perf -b 8 -e 256M -f 2 -g 3

when using 2 gpus

NCCL_DEBUG=WARN ./build/broadcast_perf -b 8 -e 256M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   5950 on dx-ai-node66 device  0 [0x34] NVIDIA H800
#  Rank  1 Group  0 Pid   5950 on dx-ai-node66 device  1 [0x48] NVIDIA H800
NCCL version 2.19.4+cuda12.2
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float    none       0    144.3    0.00    0.00      0    21.60    0.00    0.00      0
          16             4     float    none       0    21.10    0.00    0.00      0    127.7    0.00    0.00      0
          32             8     float    none       0    127.2    0.00    0.00      0    128.7    0.00    0.00      0
          64            16     float    none       0    126.3    0.00    0.00      0    127.4    0.00    0.00      0
         128            32     float    none       0    127.7    0.00    0.00      0    127.8    0.00    0.00      0
         256            64     float    none       0    127.9    0.00    0.00      0    127.6    0.00    0.00      0
         512           128     float    none       0    128.4    0.00    0.00      0    128.4    0.00    0.00      0
        1024           256     float    none       0    130.0    0.01    0.01      0    129.4    0.01    0.01      0
        2048           512     float    none       0    128.7    0.02    0.02      0    128.2    0.02    0.02      0
        4096          1024     float    none       0    128.5    0.03    0.03      0    126.9    0.03    0.03      0
        8192          2048     float    none       0    127.3    0.06    0.06      0    127.1    0.06    0.06      0
       16384          4096     float    none       0    127.4    0.13    0.13      0    127.4    0.13    0.13      0
       32768          8192     float    none       0    127.9    0.26    0.26      0    127.3    0.26    0.26      0
       65536         16384     float    none       0    127.5    0.51    0.51      0    127.3    0.51    0.51      0
      131072         32768     float    none       0    127.7    1.03    1.03      0    127.6    1.03    1.03      0
      262144         65536     float    none       0    129.0    2.03    2.03      0    128.7    2.04    2.04      0
      524288        131072     float    none       0    137.3    3.82    3.82      0    136.5    3.84    3.84      0
     1048576        262144     float    none       0    140.3    7.47    7.47      0    140.0    7.49    7.49      0
     2097152        524288     float    none       0    147.1   14.26   14.26      0    146.8   14.29   14.29      0
     4194304       1048576     float    none       0    160.4   26.15   26.15      0    159.5   26.29   26.29      0
     8388608       2097152     float    none       0    186.4   45.01   45.01      0    185.5   45.21   45.21      0
    16777216       4194304     float    none       0    360.4   46.55   46.55      0    355.8   47.16   47.16      0
    33554432       8388608     float    none       0    584.6   57.40   57.40      0    581.5   57.71   57.71      0
    67108864      16777216     float    none       0   1032.5   64.99   64.99      0   1027.6   65.31   65.31      0
   134217728      33554432     float    none       0   1799.6   74.58   74.58      0   1793.8   74.83   74.83      0
   268435456      67108864     float    none       0   3818.9   70.29   70.29      0   7625.1   35.20   35.20      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 15.3082

when using 3 gpus

NCCL_DEBUG=WARN ./build/broadcast_perf -b 8 -e 256M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   6037 on dx-ai-node66 device  0 [0x34] NVIDIA H800
#  Rank  1 Group  0 Pid   6037 on dx-ai-node66 device  1 [0x48] NVIDIA H800
#  Rank  2 Group  0 Pid   6037 on dx-ai-node66 device  2 [0x5a] NVIDIA H800
NCCL version 2.19.4+cuda12.2

dx-ai-node66:6037:6047 [1] transport/nvls.cc:169 NCCL WARN Cuda failure 1 'invalid argument'

dx-ai-node66:6037:6046 [0] transport/nvls.cc:169 NCCL WARN Cuda failure 1 'invalid argument'

dx-ai-node66:6037:6048 [2] transport/nvls.cc:169 NCCL WARN Cuda failure 1 'invalid argument'
dx-ai-node66: Test NCCL failure common.cu:954 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. dx-ai-node66 pid 6037: Test failure common.cu:844

And I found that the transport/nvls.cc:169 is https://github.com/NVIDIA/nccl/blob/b6d7438d3145a619f924dbbca6c96db21fab716e/src/transport/nvls.cc#L169 CUCHECK(cuMulticastBindMem(resources->mcHandle, 0/*mcOffset*/, resources->ucHandle, 0/*memOffset*/, size, 0/*flags*/));

sjeaugey commented 8 months ago

Seems we're trying to use NVLS when we shouldn't. Setting NCCL_NVLS_ENABLE=0 should workaround the problem.

Edit: it may be that the fabric manager was restarted but the GPUs weren't reset. You may want to reset your GPUS with nvidia-smi -r and try again.

AddyLaddy commented 8 months ago

Also see Section 2.2 of this document: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

caopulan commented 8 months ago

Seems we're trying to use NVLS when we shouldn't. Setting NCCL_NVLS_ENABLE=0 should workaround the problem.

Edit: it may be that the fabric manager was restarted but the GPUs weren't reset. You may want to reset your GPUS with nvidia-smi -r and try again.

It works! THANKS A LOT

caopulan commented 8 months ago

Also see Section 2.2 of this document: https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

ok thanks

NVIDIA / nccl-tests

Test NCCL failure common.cu:954 'unhandled cuda error" when test on >2 GPUs #183