A solved problem during parallel training.

RuoyuFeng commented 3 years ago

When I use multi-gpu for training. An error accured.

Traceback (most recent call last): File "main.py", line 73, in <module> main(args=get_args()) File "main.py", line 51, in main loss = model.forward(images1.to(args.device), images2.to(args.device)) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) AssertionError: Caught AssertionError in replica 0 on device 0. Original Traceback (most recent call last): File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data1/fengry/vcm/comfea/MySimSiam-0.1.0/model.py", line 94, in forward z1, z2 = f(x1), f(x2) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torchvision/models/resnet.py", line 220, in forward return self._forward_impl(x) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torchvision/models/resnet.py", line 204, in _forward_impl x = self.bn1(x) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward world_size = torch.distributed.get_world_size(process_group) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size return _get_group_size(group) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size _check_default_pg() File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg assert _default_pg is not None, \ AssertionError: Default process group is not initialized

Then in main.py, I add torch.distributed.init_process_group('gloo', init_method='file:///tmp/somefile', rank=0, world_size=1) before if torch.cuda.device_count() > 1: model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) , and add loss = loss.mean() before loss.backward().

Everything goes well now.

A-suozhang commented 3 years ago

I ran into the same problem, thx a lot for providing your solution!

mmw0909 commented 2 years ago

@qlz58793 Hi Ruoyu, I run into a similar issue with parallel training, but the current version doesn't seem to have if torch.cuda.device_count() > 1: model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) . Could you share your code if possible? Thanks a lot

RuoyuFeng commented 2 years ago

@qlz58793 Hi Ruoyu, I run into a similar issue with parallel training, but the current version doesn't seem to have if torch.cuda.device_count() > 1: model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model). Could you share your code if possible? Thanks a lot

I haven't run this set of code for a long time, the solution was not updated since that time, I'm sorry I can't help you.

PatrickHua / SimSiam

A solved problem during parallel training. #5