facebookresearch / moco

PyTorch implementation of MoCo: https://arxiv.org/abs/1911.05722
MIT License
4.71k stars 775 forks source link

AssertionError: Default process group is not initialized #61

Open upupbo opened 4 years ago

upupbo commented 4 years ago

Traceback (most recent call last): File "/apdcephfs/private_finechen/cbcode/moco/moco_test/run_moco.py", line 353, in main() File "/apdcephfs/private_finechen/cbcode/moco/moco_test/run_moco.py", line 134, in main main_worker(args.gpu, ngpus_per_node, args) File "/apdcephfs/private_finechen/cbcode/moco/moco_test/run_moco.py", line 226, in main_worker train(train_loader, model, criterion, optimizer, epoch, args) File "/apdcephfs/private_finechen/cbcode/moco/moco_test/run_moco.py", line 250, in train output, target = model(im_q=images1, im_k=images2) File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/apdcephfs/private_finechen/cbcode/moco/moco_test/moco/builder.py", line 133, in forward im_k, idx_unshuffle = self._batch_shuffle_ddp(im_k) File "/usr/local/lib64/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad return func(*args, *kwargs) File "/apdcephfs/private_finechen/cbcode/moco/moco_test/moco/builder.py", line 76, in _batch_shuffle_ddp x_gather = concat_all_gather(x) File "/usr/local/lib64/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad return func(args, kwargs) File "/apdcephfs/private_finechen/cbcode/moco/moco_test/moco/builder.py", line 170, in concat_all_gather tensors_gather = [torch.oneslike(tensor) for in range(torch.distributed.get_world_size())] File "/usr/local/lib64/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 586, in get_world_size return _get_group_size(group) File "/usr/local/lib64/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 202, in _get_group_size _check_default_pg() File "/usr/local/lib64/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 193, in _check_default_pg "Default process group is not initialized" AssertionError: Default process group is not initialized

ppwwyyxx commented 4 years ago

Please provide:

xuChenSJTU commented 3 years ago

Please provide:

  • what you did: do not modify the code and provide the exact command you run
  • the full logs you observed

I have the same problem. I did not revise any code, and my running command is:

CUDA_VISIBLE_DEVICES=4,5,6,7 python train_net.py --config-file configs/pascal_voc_R_50_C4_24k_moco.yaml MODEL.WEIGHTS ./output.pkl

Do you have any advice? Very appreciate.

fengxin619 commented 3 years ago

i have the same problem....how to fix?

frank-xwang commented 3 years ago

The default #GPU in the code uses 8 GPUs. You may need to change the number of GPUs, batch size, number of training iterations, and lr to make it run on 4 GPUs.