nThread 1 nGpus 1 minBytes 134217728 maxBytes 3221225472 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
nThread 1 nGpus 1 minBytes 134217728 maxBytes 3221225472 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
Rank 0 Group 0 Pid 524323 on nlp-gpu-05 device 0 [0x21] NVIDIA A100-SXM4-80GB
Rank 0 Group 0 Pid 18325 on nlp-gpu-07 device 0 [0x51] NVIDIA A100-SXM4-80GB
nlp-gpu-05:524323:524323 [0] NCCL INFO Bootstrap : Using bond0:10.25.193.57<0>
nlp-gpu-05:524323:524323 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol.
nlp-gpu-05:524323:524323 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol.
nlp-gpu-05:524323:524323 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
nlp-gpu-05:524323:524323 [0] NCCL INFO P2P plugin IBext
nlp-gpu-05:524323:524323 [0] NCCL INFO NET/IB : No device found.
nlp-gpu-05:524323:524323 [0] NCCL INFO NET/IB : No device found.
nlp-gpu-05:524323:524323 [0] NCCL INFO NET/Socket : Using [0]bond0:10.25.193.57<0>
nlp-gpu-05:524323:524323 [0] NCCL INFO Using network Socket
NCCL version 2.12.12+cuda11.7
nlp-gpu-07:18325:18325 [0] NCCL INFO Bootstrap : Using bond0:10.25.193.75<0>
nlp-gpu-07:18325:18325 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol.
nlp-gpu-07:18325:18325 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol.
nlp-gpu-07:18325:18325 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
nlp-gpu-07:18325:18325 [0] NCCL INFO P2P plugin IBext
nlp-gpu-07:18325:18325 [0] NCCL INFO NET/IB : No device found.
nlp-gpu-07:18325:18325 [0] NCCL INFO NET/IB : No device found.
nlp-gpu-07:18325:18325 [0] NCCL INFO NET/Socket : Using [0]bond0:10.25.193.75<0>
nlp-gpu-07:18325:18325 [0] NCCL INFO Using network Socket
NCCL version 2.12.12+cuda11.7
nlp-gpu-05:524323:524328 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 00/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 01/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 02/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 03/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 04/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 05/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 06/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 07/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 08/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 09/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 10/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 11/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 12/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 13/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 14/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 15/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 16/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 17/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 18/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 19/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 20/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 21/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 22/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 23/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 24/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 25/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 26/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 27/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 28/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 29/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 30/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Channel 31/32 : 0
nlp-gpu-05:524323:524328 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
nlp-gpu-05:524323:524328 [0] NCCL INFO Connected all rings
nlp-gpu-05:524323:524328 [0] NCCL INFO Connected all trees
nlp-gpu-05:524323:524328 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
nlp-gpu-05:524323:524328 [0] NCCL INFO comm 0x7f5324001000 rank 0 nranks 1 cudaDev 0 busId 21000 - Init COMPLETE
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
134217728 33554432 float sum -1 174.6 768.84 0.00 0 0.53 251038.49 0.00 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 00/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 01/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 02/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 03/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 04/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 05/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 06/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 07/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 08/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 09/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 10/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 11/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 12/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 13/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 14/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 15/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 16/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 17/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 18/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 19/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 20/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 21/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 22/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 23/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 24/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 25/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 26/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 27/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 28/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 29/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 30/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Channel 31/32 : 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
268435456 67108864 float sum -1 343.9 780.63 0.00 0 0.53 506673.19 0.00 0
nlp-gpu-07:18325:18330 [0] NCCL INFO Connected all rings
nlp-gpu-07:18325:18330 [0] NCCL INFO Connected all trees
nlp-gpu-07:18325:18330 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
nlp-gpu-07:18325:18330 [0] NCCL INFO comm 0x7f9ab4001000 rank 0 nranks 1 cudaDev 0 busId 51000 - Init COMPLETE
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
536870912 134217728 float sum -1 685.9 782.67 0.00 0 0.53 1015262.69 0.00 0
134217728 33554432 float sum -1 174.9 767.45 0.00 0 0.43 310905.09 0.00 0
1073741824 268435456 float sum -1 1363.0 787.80 0.00 0 0.53 2034950.87 0.00 0
268435456 67108864 float sum -1 344.4 779.51 0.00 0 0.43 624050.81 0.00 0
536870912 134217728 float sum -1 671.8 799.20 0.00 0 0.43 1257308.93 0.00 0
2147483648 536870912 float sum -1 2655.4 808.73 0.00 0 0.53 4074148.45 0.00 0
nlp-gpu-05:524323:524323 [0] NCCL INFO comm 0x7f5324001000 rank 0 nranks 1 cudaDev 0 busId 21000 - Destroy COMPLETE
Out of bounds values : 0 OK
Avg bus bandwidth : 0
1073741824 268435456 float sum -1 1334.3 804.72 0.00 0 0.43 2522000.76 0.00 0
2147483648 536870912 float sum -1 2660.6 807.13 0.00 0 0.43 5035722.00 0.00 0
nlp-gpu-07:18325:18325 [0] NCCL INFO comm 0x7f9ab4001000 rank 0 nranks 1 cudaDev 0 busId 51000 - Destroy COMPLETE
Out of bounds values : 0 OK
Avg bus bandwidth : 0
It seems like no ring was constructed between this two nodes and two separate rings work on each node. But I expect the rings are across this two nodes.
Anyone tell me what's wrong ?
I use the nccl-test on two A100 nodes.
The log is :
It seems like no ring was constructed between this two nodes and two separate rings work on each node. But I expect the rings are across this two nodes. Anyone tell me what's wrong ?