facebookresearch / moco-v3

PyTorch implementation of MoCo v3 https//arxiv.org/abs/2104.02057
Other
1.22k stars 161 forks source link

Does this implementation support non-distributed training? #15

Closed Euphoria16 closed 3 years ago

Euphoria16 commented 3 years ago

I found if I didn't use distributed training, i.e. set the --multiprocessing-distributed=False and use single GPU, there seems to be no problems in main_moco.py with

   torch.cuda.set_device(args.gpu)
   model = model.cuda(args.gpu)

However, this error occurred when training started

AssertionError: Default process group is not initialized

This error can be traced back to

File "~/moco-v3/moco/builder.py", line 68, in contrastive_loss k = concat_all_gather(k)

and

File "~/moco-v3/moco/builder.py", line 178, in concat_allgather for in range(torch.distributed.get_world_size())]

This error is caused by computation of contrastive_loss, which still relies on distributed training. So I wonder if the non-distributed training is not supported even if we set multiprocessing-distributed=False.

endernewton commented 3 years ago

Yep, I think some easy way to enable it is to just skip that line -- because it performing an all-gather operation from all gpus, and since in the non-distributed training it only has one gpu, so this line is not needed.