NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
876 stars 238 forks source link

No commuication between two nodes #136

Closed GongZhengLi closed 1 year ago

GongZhengLi commented 1 year ago

I use the nccl-test on two A100 nodes.

/usr/local/mpi/bin/mpirun -x NCCL_DEBUG=INFO -x NCCL_ALGO=RING --allow-run-as-root -n 2 -np 2 -host local:1,peer:1 ./build/all_reduce_perf -b 128M -e 3G -f 2 -t 1 -g 1

The log is :

nThread 1 nGpus 1 minBytes 134217728 maxBytes 3221225472 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
nThread 1 nGpus 1 minBytes 134217728 maxBytes 3221225472 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
Rank  0 Group  0 Pid 524323 on nlp-gpu-05 device  0 [0x21] NVIDIA A100-SXM4-80GB
Rank  0 Group  0 Pid  18325 on nlp-gpu-07 device  0 [0x51] NVIDIA A100-SXM4-80GB
nlp-gpu-05:524323:524323 [0] NCCL INFO Bootstrap : Using bond0:10.25.193.57<0>
nlp-gpu-05:524323:524323 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol.
nlp-gpu-05:524323:524323 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol.
nlp-gpu-05:524323:524323 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
nlp-gpu-05:524323:524323 [0] NCCL INFO P2P plugin IBext
nlp-gpu-05:524323:524323 [0] NCCL INFO NET/IB : No device found.
nlp-gpu-05:524323:524323 [0] NCCL INFO NET/IB : No device found.
nlp-gpu-05:524323:524323 [0] NCCL INFO NET/Socket : Using [0]bond0:10.25.193.57<0>
nlp-gpu-05:524323:524323 [0] NCCL INFO Using network Socket
NCCL version 2.12.12+cuda11.7
nlp-gpu-07:18325:18325 [0] NCCL INFO Bootstrap : Using bond0:10.25.193.75<0>
nlp-gpu-07:18325:18325 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol.
nlp-gpu-07:18325:18325 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol.
nlp-gpu-07:18325:18325 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
nlp-gpu-07:18325:18325 [0] NCCL INFO P2P plugin IBext
nlp-gpu-07:18325:18325 [0] NCCL INFO NET/IB : No device found.
nlp-gpu-07:18325:18325 [0] NCCL INFO NET/IB : No device found.
nlp-gpu-07:18325:18325 [0] NCCL INFO NET/Socket : Using [0]bond0:10.25.193.75<0>
nlp-gpu-07:18325:18325 [0] NCCL INFO Using network Socket
NCCL version 2.12.12+cuda11.7
nlp-gpu-05:524323:524328 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 00/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 01/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 02/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 03/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 04/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 05/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 06/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 07/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 08/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 09/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 10/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 11/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 12/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 13/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 14/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 15/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 16/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 17/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 18/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 19/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 20/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 21/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 22/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 23/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 24/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 25/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 26/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 27/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 28/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 29/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 30/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 31/32 :    0
nlp-gpu-05:524323:524328 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
nlp-gpu-05:524323:524328 [0] NCCL INFO Connected all rings
nlp-gpu-05:524323:524328 [0] NCCL INFO Connected all trees
nlp-gpu-05:524323:524328 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
nlp-gpu-05:524323:524328 [0] NCCL INFO comm 0x7f5324001000 rank 0 nranks 1 cudaDev 0 busId 21000 - Init COMPLETE

                                                              out-of-place                       in-place          
       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
   134217728      33554432     float     sum      -1    174.6  768.84    0.00      0     0.53  251038.49    0.00      0
nlp-gpu-07:18325:18330 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 00/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 01/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 02/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 03/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 04/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 05/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 06/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 07/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 08/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 09/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 10/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 11/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 12/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 13/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 14/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 15/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 16/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 17/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 18/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 19/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 20/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 21/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 22/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 23/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 24/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 25/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 26/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 27/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 28/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 29/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 30/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 31/32 :    0
nlp-gpu-07:18325:18330 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
   268435456      67108864     float     sum      -1    343.9  780.63    0.00      0     0.53  506673.19    0.00      0
nlp-gpu-07:18325:18330 [0] NCCL INFO Connected all rings
nlp-gpu-07:18325:18330 [0] NCCL INFO Connected all trees
nlp-gpu-07:18325:18330 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
nlp-gpu-07:18325:18330 [0] NCCL INFO comm 0x7f9ab4001000 rank 0 nranks 1 cudaDev 0 busId 51000 - Init COMPLETE

                                                              out-of-place                       in-place          
       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
   536870912     134217728     float     sum      -1    685.9  782.67    0.00      0     0.53  1015262.69    0.00      0
   134217728      33554432     float     sum      -1    174.9  767.45    0.00      0     0.43  310905.09    0.00      0
  1073741824     268435456     float     sum      -1   1363.0  787.80    0.00      0     0.53  2034950.87    0.00      0
   268435456      67108864     float     sum      -1    344.4  779.51    0.00      0     0.43  624050.81    0.00      0
   536870912     134217728     float     sum      -1    671.8  799.20    0.00      0     0.43  1257308.93    0.00      0
  2147483648     536870912     float     sum      -1   2655.4  808.73    0.00      0     0.53  4074148.45    0.00      0
nlp-gpu-05:524323:524323 [0] NCCL INFO comm 0x7f5324001000 rank 0 nranks 1 cudaDev 0 busId 21000 - Destroy COMPLETE
Out of bounds values : 0 OK
Avg bus bandwidth    : 0 

  1073741824     268435456     float     sum      -1   1334.3  804.72    0.00      0     0.43  2522000.76    0.00      0
  2147483648     536870912     float     sum      -1   2660.6  807.13    0.00      0     0.43  5035722.00    0.00      0
nlp-gpu-07:18325:18325 [0] NCCL INFO comm 0x7f9ab4001000 rank 0 nranks 1 cudaDev 0 busId 51000 - Destroy COMPLETE
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 0 

It seems like no ring was constructed between this two nodes and two separate rings work on each node. But I expect the rings are across this two nodes. Anyone tell me what's wrong ?

sjeaugey commented 1 year ago

You need to recompile the NCCL perf tests adding MPI=1 to your make command.

GongZhengLi commented 1 year ago

It works for me, thank you !