Open imdoublecats opened 4 years ago
Bro, thanks for your help !! I got the same error"Default process group is not initialized" and my torch.version is also 1.6.
In the same boat here, however, I have a bit more information for anyone that wants to do more digging.
I've been using PyTorch 1.6 with detectron2 (9eb4831 as recommended) in order to train the MS_DLA_34_4x_syncbn_shared_towers_bn_head.yaml
model for months now without issue.
However, as of last week, when my cloud computing provider forced a mandatory kernel update (see below), I am now getting the exact same error:
[12/08 20:47:06 adet.trainer]: Starting training from iteration 0
Traceback (most recent call last):
File "tools/train_net.py", line 237, in <module>
launch(
File "/root/code/detectron2/detectron2/engine/launch.py", line 62, in launch
main_func(*args)
File "tools/train_net.py", line 231, in main
return trainer.train()
File "tools/train_net.py", line 113, in train
self.train_loop(self.start_iter, self.max_iter)
File "tools/train_net.py", line 102, in train_loop
self.run_step()
File "/root/code/detectron2/detectron2/engine/train_loop.py", line 216, in run_step
loss_dict = self.model(data)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/code/adet/adet/modeling/one_stage_detector.py", line 46, in forward
return super().forward(batched_inputs)
File "/root/code/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 274, in forward
features = self.backbone(images.tensor)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/code/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward
bottom_up_features = self.bottom_up(x)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/code/adet/adet/modeling/backbone/dla.py", line 302, in forward
x = self.base_layer(x)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size
return _get_group_size(group)
File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
_check_default_pg()
File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
assert _default_pg is not None, \
AssertionError: Default process group is not initialized
cat /var/log/dpkg.log
returns:
ubuntu@host:~$ cat /var/log/dpkg.log
2020-12-02 06:47:25 startup archives unpack
2020-12-02 06:47:25 install linux-modules-5.4.0-1030-aws:amd64 <none> 5.4.0-1030.31~18.04.1
2020-12-02 06:47:25 status half-installed linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status unpacked linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status unpacked linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 install linux-image-5.4.0-1030-aws:amd64 <none> 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status half-installed linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status unpacked linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status unpacked linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 upgrade linux-aws:amd64 5.4.0.1029.14 5.4.0.1030.15
2020-12-02 06:47:27 status half-configured linux-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status unpacked linux-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status half-installed linux-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status half-installed linux-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status unpacked linux-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:27 status unpacked linux-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:27 upgrade linux-image-aws:amd64 5.4.0.1029.14 5.4.0.1030.15
2020-12-02 06:47:27 status half-configured linux-image-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status unpacked linux-image-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status half-installed linux-image-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status half-installed linux-image-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:27 status unpacked linux-image-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:27 status unpacked linux-image-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:27 install linux-aws-5.4-headers-5.4.0-1030:all <none> 5.4.0-1030.31~18.04.1
2020-12-02 06:47:27 status half-installed linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:29 status unpacked linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:29 status unpacked linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:30 install linux-headers-5.4.0-1030-aws:amd64 <none> 5.4.0-1030.31~18.04.1
2020-12-02 06:47:30 status half-installed linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:30 status unpacked linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:30 status unpacked linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:30 upgrade linux-headers-aws:amd64 5.4.0.1029.14 5.4.0.1030.15
2020-12-02 06:47:30 status half-configured linux-headers-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:30 status unpacked linux-headers-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:30 status half-installed linux-headers-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:30 status half-installed linux-headers-aws:amd64 5.4.0.1029.14
2020-12-02 06:47:30 status unpacked linux-headers-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:30 status unpacked linux-headers-aws:amd64 5.4.0.1030.15
2020-12-02 06:47:31 startup packages configure
2020-12-02 06:47:31 configure linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1 <none>
2020-12-02 06:47:31 status unpacked linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 status half-configured linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 status installed linux-aws-5.4-headers-5.4.0-1030:all 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 configure linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1 <none>
2020-12-02 06:47:31 status unpacked linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 status half-configured linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 status installed linux-modules-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 configure linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1 <none>
2020-12-02 06:47:31 status unpacked linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:47:31 status half-configured linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 status installed linux-headers-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 configure linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1 <none>
2020-12-02 06:48:24 status unpacked linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 status half-configured linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 status installed linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 status triggers-pending linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:24 configure linux-headers-aws:amd64 5.4.0.1030.15 <none>
2020-12-02 06:48:24 status unpacked linux-headers-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status half-configured linux-headers-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status installed linux-headers-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 configure linux-image-aws:amd64 5.4.0.1030.15 <none>
2020-12-02 06:48:24 status unpacked linux-image-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status half-configured linux-image-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status installed linux-image-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 configure linux-aws:amd64 5.4.0.1030.15 <none>
2020-12-02 06:48:24 status unpacked linux-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status half-configured linux-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 status installed linux-aws:amd64 5.4.0.1030.15
2020-12-02 06:48:24 trigproc linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1 <none>
2020-12-02 06:48:24 status half-configured linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-02 06:48:36 status installed linux-image-5.4.0-1030-aws:amd64 5.4.0-1030.31~18.04.1
2020-12-03 06:25:37 startup packages remove
2020-12-03 06:25:37 status installed linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:38 remove linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1 <none>
2020-12-03 06:25:38 status half-configured linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:40 status half-installed linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status config-files linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status config-files linux-image-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status installed linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 remove linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1 <none>
2020-12-03 06:25:42 status half-configured linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status half-installed linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status config-files linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 status config-files linux-modules-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:42 startup packages configure
2020-12-03 06:25:45 startup packages remove
2020-12-03 06:25:45 status installed linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:45 remove linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1 <none>
2020-12-03 06:25:45 status half-configured linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:45 status half-installed linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:47 status config-files linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:47 status config-files linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:47 status config-files linux-headers-5.4.0-1028-aws:amd64 5.4.0-1028.29~18.04.1
2020-12-03 06:25:47 status not-installed linux-headers-5.4.0-1028-aws:amd64 <none>
2020-12-03 06:25:47 startup packages configure
Reverting to PyTorch 1.5 of course fixes the issue (for the same reason described in the original question) since it simply changes the nn.SyncBatchNorm
implementation to NaiveSyncBatchNorm
(you can also set this explicitly in the configuration file by setting NORM: naiveSyncBN
). However, it would be great to get to the root cause.
For a minimum working example, simply install AdelaiDet as described (you can even use the pre-built docker container provided -- just make sure to upgrade PyTorch to 1.6 and rebuild detectron2 + adet), download the coco 2017 dataset to the datasets/
directory, and run the demo training example:
python3 tools/train_net.py --config-file configs/FCOS-Detection/FCOS_RT/MS_DLA_34_4x_syncbn_shared_towers_bn_head.yaml OUTPUT_DIR /tmp
NOTE: I tried using the Dockerfile to create the minimum working example, however, it doesn't work due to its use of the latest detectron2 version (ignoring the recommended version hash 9eb4831
). Out of the box, you will first get a cv2
not installed error, followed by a libGL.so
import error. Although fixing these will take you back to the original issue.
NOTE2: The same error persists with PyTorch 1.7 as well.
Thanks in advance.
UPDATE: This might be solved in light of this thread https://github.com/facebookresearch/detectron2/issues/2174 ... I too was only using num_gpus==1 for training originally. Still not sure why it was working for so long though.
As @ashariati mentioned, if gpu_num == 1 and num_machines == 1, then there is no point using SyncBatchNorm
.
I'm not using AdelaiDet, so not very sure, but I guess you should have set cfg properly:
for example, set cfg.MODEL.CONDINST.MASK_BRANCH.NORM
and cfg.MODEL.BASIS_MODULE.NORM
to "BN", which makes the output of the get_norm
be BatchNorm2d
, instead of nn.SyncBatchNorm
When training BlendMask with gpu_num=1, torch.version=1.6 In adet/layers/conv_with_kaiming_uniform.py line44: get_norm(norm, out_channels) In this function, when env.TORCH_VERSION > (1, 5), as I did in torch 1.6, nn.SycnBatchNorm is used. However, when gpu_num == 1 and num_machines == 1, in detectron2/engine/lauch.py line41: world_size == 1 Then in line55 mp.spawn() , function _distributed_worker() is not executed, so does line71: dist.init_process_group() Then we look back at the nn.SycnBatchNorm, when it is used, it will run_check_default_pg() to checks if the default ProcessGroup has been initialized, and without dist.init_process_group(), the check will not pass. These cause the error: "Default process group is not initialized" AssertionError: Default process group is not initialized
Simplely change (1, 5) to (1, 6) in detectron2/layers/batch_norm.py line143 can solve the problem temporarily but is not a good way. I am not sure to report this problem to AdelaiDet or to Detectron2, as I met it when I was training BlendMask, I decide to report it here.