cvg / pixloc

Back to the Feature: Learning Robust Camera Localization from Pixels to Pose (CVPR 2021)
Apache License 2.0
735 stars 92 forks source link

Dead lock after torch.distributed.all_reduce #29

Open loocy3 opened 2 years ago

loocy3 commented 2 years ago

HI Skydes,

I got dead lock in 'loss.backward' after 'torch.distributed.all_reduce' after 3 steps training. I do not understand these source code, could you help me to understand this part? I think the 'loss.requires_grad' is always 'True', will it turn to 0 after all_reduce? Thanks in advance.


do_backward = loss.requires_grad if args.distributed: do_backward = torch.tensor(do_backward).float().to(device) torch.distributed.all_reduce( do_backward, torch.distributed.ReduceOp.PRODUCT) do_backward = do_backward > 0 if do_backward: loss.backward() optimizer.step()


Here is the NCCL INFO log in case you want to know.


[01/18/2022 14:14:57 pixloc INFO] [E 15 | it 0] loss {total 2.150E+01, reprojection_error/0 3.094E+01, reprojection_error/1 3.357E+01, reprojection_error/2 3.350E+01, reprojection_error 3.350E+01, reprojection_error/init 3.094E+01} mlcv3:30429:30429 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fe9a31ff000 recvbuff 0x7fe9a31ff000 count 3584 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f851f14400 mlcv3:30429:30429 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fe9a31ffe00 recvbuff 0x7fe9a31ffe00 count 64 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f851fdc000 mlcv3:30428:30428 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f3467dfde00 recvbuff 0x7f3467dfde00 count 3584 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x5619dddaba90 mlcv3:30428:30428 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f32379fe200 recvbuff 0x7f32379fe200 count 64 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194a3f3ce0 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7febdbdffa00 recvbuff 0x7fe9adbffa00 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f856413860 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f3467dffa00 recvbuff 0x7f331be63000 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194ddec910 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a31ff000 recvbuff 0x7fe9a31ff400 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f8566aee90 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f331be63600 recvbuff 0x7f3467dfde00 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194ec0abe0 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a31ffa00 recvbuff 0x7febdbdfde00 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f8565f0d90 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f3467dfe400 recvbuff 0x7f34475fb400 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194a3e2520 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7febdbdfe400 recvbuff 0x7febdbdfe600 count 260 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f851dd5710 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f347d5ee400 recvbuff 0x7f3467dfe800 count 260 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194e3e1c60 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a31ff000 recvbuff 0x7fe9a31ff400 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f856413860 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f331be63200 recvbuff 0x7f3467dfde00 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194ddebcd0 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9adbffc00 recvbuff 0x7febdbdfde00 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f851b283a0 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f3467dfe400 recvbuff 0x7f34475fb400 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x5619ddaf0f90 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7febdbdfe600 recvbuff 0x7fe9a1ffe200 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f855a2b2a0 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f3467dfe800 recvbuff 0x7f34475fba00 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194ed2aa70 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7febdbdfec00 recvbuff 0x7fe9a1ffe800 count 260 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f851df4e80 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f34475fc200 recvbuff 0x7f34475fc400 count 260 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194a5fc430 mlcv3:30429:30429 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe9a9dff000 recvbuff 0x7fe9a9dff000 count 1 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f855e46980 mlcv3:30428:30428 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f327b9fee00 recvbuff 0x7f327b9fee00 count 1 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194a8a2f20 mlcv3:30429:30503 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a1bffc00 recvbuff 0x7fe9a21fdc00 count 260 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd0e2e530 mlcv3:30428:30504 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f32317ffc00 recvbuff 0x7f32399fde00 count 260 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34615040e0 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f32399fe000 recvbuff 0x7f32399fe000 count 64 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3460c10240 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe9a1bffa00 recvbuff 0x7fe9a1bffa00 count 64 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd0481840 mlcv3:30428:30504 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f32395ffc00 recvbuff 0x7f32395ff400 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34612dade0 mlcv3:30429:30503 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a3ffec00 recvbuff 0x7fe9a1bff800 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd047feb0 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe9a3ffec00 recvbuff 0x7fe9a3ffec00 count 128 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd04aba40 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f323b3ff600 recvbuff 0x7f323b3ff600 count 128 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f346152e760 mlcv3:30429:30503 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9adbffa00 recvbuff 0x7fe9a3ffec00 count 260 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd0e2dfe0 mlcv3:30428:30504 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f327b9ff000 recvbuff 0x7f322dffee00 count 260 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3460c014c0 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe9abfffc00 recvbuff 0x7fe9abfffc00 count 64 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd1054aa0 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f347d5ed000 recvbuff 0x7f347d5ed000 count 64 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34613b65e0 mlcv3:30429:30503 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9abfffc00 recvbuff 0x7fe9a4b5de00 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd02e3f00 mlcv3:30428:30504 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f3237ffe400 recvbuff 0x7f32807ffa00 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34601d0e80 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f347d5ed000 recvbuff 0x7f347d5ed000 count 128 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34612dadc0 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7febdbdfe400 recvbuff 0x7febdbdfe400 count 128 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd12004c0 mlcv3:30428:30504 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f3237ffe400 recvbuff 0x7f32807ffa00 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34614e3310 mlcv3:30429:30503 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a39fde00 recvbuff 0x7fe9a39fec00 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd1128e50 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f347d5dc600 recvbuff 0x7f347d5dc600 count 128 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fea8fe5dc00 recvbuff 0x7fea8fe5dc00 count 128 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd05b7b60 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34613b1c90 mlcv3:30429:30503 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a39fde00 recvbuff 0x7fe9a39fec00 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd05e4770 mlcv3:30428:30504 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f32309fe000 recvbuff 0x7f32807ffa00 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f346060d110 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f327ebfea00 recvbuff 0x7f327ebfea00 count 128 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34612f51c0 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe9a39fde00 recvbuff 0x7fe9a39fde00 count 128 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd05b7a60 mlcv3:30429:30503 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a4bffc00 recvbuff 0x7fe9adbfec00 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd06a8820 mlcv3:30428:30504 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f3467dfe200 recvbuff 0x7f32807ffa00 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34614e3d50 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f347d502800 recvbuff 0x7f347d502800 count 128 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe9a4bffc00 recvbuff 0x7fe9a4bffc00 count 128 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd04aba40 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34613b1c70 mlcv3:30429:30503 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fea8fe22a00 recvbuff 0x7fe9adbfec00 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd042c4c0 mlcv3:30428:30504 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f322cdfcc00 recvbuff 0x7f32807ffa00 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3460675680 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f347d502800 recvbuff 0x7f347d502800 count 128 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3461505340 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7febf15ed000 recvbuff 0x7febf15ed000 count 128 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd0617930 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f3474d80000 recvbuff 0x7f3474d80000 count 998229 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3461505650 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f3466000000 recvbuff 0x7f3466000000 count 7079424 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3460710430 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7febe8d80000 recvbuff 0x7febe8d80000 count 998229 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd1129e00 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7febda000000 recvbuff 0x7febda000000 count 7079424 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd02e3d60 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f3452000000 recvbuff 0x7f3452000000 count 8632352 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3460468ac0 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f3454200000 recvbuff 0x7f3454200000 count 7079424 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34601d0ea0 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f3450000000 recvbuff 0x7f3450000000 count 7079680 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3461560000 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f347514ee00 recvbuff 0x7f347514ee00 count 555072 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3460468ac0 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7febc6000000 recvbuff 0x7febc6000000 count 8632352 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd00e1130 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7febc8200000 recvbuff 0x7febc8200000 count 7079424 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd04abab0 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7febc4000000 recvbuff 0x7febc4000000 count 7079680 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd120bfc0 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7febe914ee00 recvbuff 0x7febe914ee00 count 555072 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd00e1130 mlcv3:30428:30428 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f347d5dc600 recvbuff 0x7f347d5dc600 count 392 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194e3ca620 mlcv3:30428:30428 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f347d5ec600 recvbuff 0x7f347d5ec600 count 24 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194e3fa1b0 mlcv3:30428:30428 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f3467dfde00 recvbuff 0x7f3467dfde00 count 3584 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194a6aef20 mlcv3:30428:30428 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f322cbffe00 recvbuff 0x7f322cbffe00 count 64 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194e621160 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f322f7ffc00 recvbuff 0x7f32379fde00 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194a3a9b70 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f3467dffa00 recvbuff 0x7f3231bff400 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x5619dda9cd50 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f32311ff400 recvbuff 0x7f32311ff800 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194e59ff20 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f347d5eda00 recvbuff 0x7f331be63000 count 260 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194e3f37f0 mlcv3:30429:30429 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7febf15b7c00 recvbuff 0x7febf15b7c00 count 392 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f852247ac0 mlcv3:30429:30429 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7febf15dc600 recvbuff 0x7febf15dc600 count 24 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f85666f230 mlcv3:30429:30429 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fe9a11ff200 recvbuff 0x7fe9a11ff200 count 3584 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f85559cd30 mlcv3:30429:30429 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fe9a0dffc00 recvbuff 0x7fe9a0dffc00 count 64 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f851db9f00 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9ab9fe000 recvbuff 0x7fe9adbfec00 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f851d538a0 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9adbffc00 recvbuff 0x7fe9a11ff200 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f851f7f9b0 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7febdbdffa00 recvbuff 0x7fe9a11ff800 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f855f419e0 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a31ff000 recvbuff 0x7fe9a31ff200 count 260 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f851db5960 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7febdbdffa00 recvbuff 0x7fe9adbfec00 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f8557a3610 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f3467dffa00 recvbuff 0x7f3231bff400 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194e648e00 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a11ff400 recvbuff 0x7fe9a11ff800 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f855cc0650 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f32379fe000 recvbuff 0x7f32311ff400 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194a3a9b70 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a31ff200 recvbuff 0x7fe9a31ff600 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f855b28480 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f331be63000 recvbuff 0x7f331be63400 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194a5e05e0 mlcv3:30429:30429 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9adbffe00 recvbuff 0x7febdbdfde00 count 260 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f851eadca0 mlcv3:30428:30428 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f32311ffc00 recvbuff 0x7f3467dfde00 count 260 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194e0bc460 mlcv3:30429:30429 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe9acbff600 recvbuff 0x7fe9acbff600 count 1 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30429 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x55f85617bc60 mlcv3:30428:30428 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f322cf75a00 recvbuff 0x7f322cf75a00 count 1 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30428 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x56194a848100 mlcv3:30429:30503 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a6dffa00 recvbuff 0x7fea8fffa000 count 260 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd0592b70 mlcv3:30428:30504 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f322e5ff800 recvbuff 0x7f32303ffc00 count 260 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3460112b80 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f32303ffc00 recvbuff 0x7f32303ffc00 count 64 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3461315800 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe9a6dffa00 recvbuff 0x7fe9a6dffa00 count 64 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd0b29dd0 mlcv3:30428:30504 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f322cf75c00 recvbuff 0x7f32309ff000 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3460617ba0 mlcv3:30429:30503 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a59ff200 recvbuff 0x7fe9a1904c00 count 516 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd00e1130 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe9a5ffee00 recvbuff 0x7fe9a5ffee00 count 128 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd120bfa0 mlcv3:30428:30504 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f32309fea00 recvbuff 0x7f32309fea00 count 128 datatype 7 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f34601d0ea0 mlcv3:30429:30503 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe9a45fde00 recvbuff 0x7fe9a53fda00 count 260 datatype 0 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd12000c0 mlcv3:30428:30504 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7f322cbfe400 recvbuff 0x7f3230fffa00 count 516 datatype 0 op 0 root 0 comm 0x7f3460001200 [nranks=2] stream 0x56194a313850 mlcv3:30428:30504 [0] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7f3460675280 mlcv3:30429:30503 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fe9a57fe800 recvbuff 0x7fe9a57fe800 count 64 datatype 7 op 0 root 0 comm 0x7febdc001200 [nranks=2] stream 0x55f851b1a060 mlcv3:30429:30503 [1] NCCL INFO group.cc:306 Mem Alloc Size 8 pointer 0x7febd1318b80