Closed RuoyuFeng closed 3 years ago
I ran into the same problem, thx a lot for providing your solution!
@qlz58793 Hi Ruoyu, I run into a similar issue with parallel training, but the current version doesn't seem to have if torch.cuda.device_count() > 1: model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
. Could you share your code if possible? Thanks a lot
@qlz58793 Hi Ruoyu, I run into a similar issue with parallel training, but the current version doesn't seem to have
if torch.cuda.device_count() > 1: model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
. Could you share your code if possible? Thanks a lot
I haven't run this set of code for a long time, the solution was not updated since that time, I'm sorry I can't help you.
When I use multi-gpu for training. An error accured.
Traceback (most recent call last): File "main.py", line 73, in <module> main(args=get_args()) File "main.py", line 51, in main loss = model.forward(images1.to(args.device), images2.to(args.device)) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) AssertionError: Caught AssertionError in replica 0 on device 0. Original Traceback (most recent call last): File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data1/fengry/vcm/comfea/MySimSiam-0.1.0/model.py", line 94, in forward z1, z2 = f(x1), f(x2) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torchvision/models/resnet.py", line 220, in forward return self._forward_impl(x) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torchvision/models/resnet.py", line 204, in _forward_impl x = self.bn1(x) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward world_size = torch.distributed.get_world_size(process_group) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size return _get_group_size(group) File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size _check_default_pg() File "/data/fengry/anaconda3/envs/pytorch17/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg assert _default_pg is not None, \ AssertionError: Default process group is not initialized
Then in main.py, I add
torch.distributed.init_process_group('gloo', init_method='file:///tmp/somefile', rank=0, world_size=1)
beforeif torch.cuda.device_count() > 1: model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
, and addloss = loss.mean()
beforeloss.backward()
.Everything goes well now.