all_reduce_perf fails on 2 nodes

scvance commented 1 year ago

I am trying to run the all_reduce_perf test on two nodes, both of which have 8 A100 80GB gpus. I am supposed to be using an infiniband HDR network.

I am using the slurm job manager. When I run the command NCCL_DEBUG=INFO UCX_NET_DEVICES=mlx5_1:1,mlx5_11:1,mlx5_13:1,mlx5_3:1,mlx5_5:1,mlx5_7:1,mlx5_10:1,mlx5_12:1,mlx5_16:1,mlx5_18:1,mlx5_4:1,mlx5_6:1,mlx5_0:1,mlx5_2:1 UCX_TLS=ud NCCL_IB_HCA='^=mlx5_8,^=mlx5_14' srun --mpi=pmix all_reduce_perf -b 8 -e 128M -f 2 -g 1 -t 1

I use only two gpus, one per node, and I get an output as follows

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    19.90    0.00    0.00      0    18.96    0.00    0.00      0
          16             4     float     sum      -1    20.67    0.00    0.00      0    19.10    0.00    0.00      0
          32             8     float     sum      -1    19.16    0.00    0.00      0    18.98    0.00    0.00      0
          64            16     float     sum      -1    19.36    0.00    0.00      0    19.07    0.00    0.00      0
         128            32     float     sum      -1    19.56    0.01    0.01      0    19.09    0.01    0.01      0
         256            64     float     sum      -1    19.66    0.01    0.01      0    19.82    0.01    0.01      0
         512           128     float     sum      -1    19.96    0.03    0.03      0    20.32    0.03    0.03      0
        1024           256     float     sum      -1    20.61    0.05    0.05      0    20.38    0.05    0.05      0
        2048           512     float     sum      -1    20.91    0.10    0.10      0    20.87    0.10    0.10      0
        4096          1024     float     sum      -1    22.06    0.19    0.19      0    21.93    0.19    0.19      0
        8192          2048     float     sum      -1    24.35    0.34    0.34      0    24.14    0.34    0.34      0
       16384          4096     float     sum      -1    46.40    0.35    0.35      0    27.47    0.60    0.60      0
       32768          8192     float     sum      -1    36.53    0.90    0.90      0    36.59    0.90    0.90      0
       65536         16384     float     sum      -1    52.98    1.24    1.24      0    53.92    1.22    1.22      0
      131072         32768     float     sum      -1    86.13    1.52    1.52      0    88.34    1.48    1.48      0
      262144         65536     float     sum      -1    146.3    1.79    1.79      0    148.9    1.76    1.76      0
      524288        131072     float     sum      -1    162.4    3.23    3.23      0    145.5    3.60    3.60      0
     1048576        262144     float     sum      -1    278.1    3.77    3.77      0    261.0    4.02    4.02      0
     2097152        524288     float     sum      -1    483.6    4.34    4.34      0    468.3    4.48    4.48      0
     4194304       1048576     float     sum      -1    854.6    4.91    4.91      0    853.1    4.92    4.92      0
     8388608       2097152     float     sum      -1   1618.7    5.18    5.18      0   1646.5    5.09    5.09      0
    16777216       4194304     float     sum      -1   3234.6    5.19    5.19      0   3240.5    5.18    5.18      0
    33554432       8388608     float     sum      -1   6459.4    5.19    5.19      0   6465.8    5.19    5.19      0
    67108864      16777216     float     sum      -1    12859    5.22    5.22      0    12916    5.20    5.20      0
   134217728      33554432     float     sum      -1    26012    5.16    5.16      0    26006    5.16    5.16      0
dw-2-2:34783:34783 [0] NCCL INFO comm 0x28eb470 rank 1 nranks 2 cudaDev 0 busId 46000 - Destroy COMPLETE
dw-1-1:85083:85083 [0] NCCL INFO comm 0x3ad72e0 rank 0 nranks 2 cudaDev 0 busId 46000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.96441 
#

I am able to increase the amount of gpus per node up until I try to use 5 or more, at which point I get this error:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

dw-1-1:86614:86807 [4] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer 192.168.205.50<58762> with error 12, opcode 0, len 0, vendor err 129 (Recv)
dw-1-1:86614:86807 [4] NCCL INFO transport/net.cc:1134 -> 6
dw-1-1:86614:86807 [4] NCCL INFO proxy.cc:679 -> 6
dw-1-1:86614:86807 [4] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

dw-2-2:35593:35650 [4] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer 192.168.205.2<48334> with error 12, opcode 0, len 0, vendor err 129 (Recv)
dw-2-2:35593:35650 [4] NCCL INFO transport/net.cc:1134 -> 6
dw-2-2:35593:35650 [4] NCCL INFO proxy.cc:679 -> 6
dw-2-2:35593:35650 [4] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

dw-2-2:35593:35649 [0] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer 192.168.205.2<34566> with error 12, opcode 0, len 0, vendor err 129 (Recv)
dw-2-2:35593:35649 [0] NCCL INFO transport/net.cc:1134 -> 6
dw-2-2:35593:35649 [0] NCCL INFO proxy.cc:679 -> 6
dw-2-2:35593:35649 [0] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

dw-1-1:86614:86805 [0] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer 192.168.205.50<55438> with error 12, opcode 0, len 0, vendor err 129 (Recv)
dw-1-1:86614:86805 [0] NCCL INFO transport/net.cc:1134 -> 6
dw-1-1:86614:86805 [0] NCCL INFO proxy.cc:679 -> 6
dw-1-1:86614:86805 [0] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

The program then hangs at this point without terminating. If I open nvidia-smi I see that all gpus claim 100% utilization

Has anyone run into this problem before?

sjeaugey commented 1 year ago

Regarding:

NCCL_IB_HCA='^=mlx5_8,^=mlx5_14'

You should use ^ and = only once, to indicate the list is exact and it is an exclude list.

So, I'd set NCCL_IB_HCA=^=mlx5_8,mlx5_14.

scvance commented 1 year ago

Thanks, this actually helped a lot!!

NVIDIA / nccl-tests

all_reduce_perf fails on 2 nodes #150