chensnathan / YOLOF

You Only Look One-level Feature (YOLOF), CVPR2021, Detectron2
MIT License
271 stars 28 forks source link

Default process group is not initialized #10

Closed Ilin3170 closed 3 years ago

Ilin3170 commented 3 years ago

[04/02 17:29:16 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model: backbone.classifier.{bias, weight} [04/02 17:29:16 d2.engine.train_loop]: Starting training from iteration 0 ERROR [04/02 17:29:18 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/home/wangyixin/detectron2/detectron2/engine/train_loop.py", line 138, in train self.run_step() File "/home/wangyixin/detectron2/detectron2/engine/defaults.py", line 441, in run_step self._trainer.run_step() File "/home/wangyixin/detectron2/detectron2/engine/train_loop.py", line 232, in run_step loss_dict = self.model(data) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/wangyixin/YOLOF/yolof/modeling/yolof.py", line 273, in forward features = self.backbone(images.tensor) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/wangyixin/YOLOF/yolof/modeling/backbone/darknet.py", line 368, in forward x = self.bn1(x) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward world_size = torch.distributed.get_world_size(process_group) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size return _get_group_size(group) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size _check_default_pg() File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg "Default process group is not initialized" AssertionError: Default process group is not initialized [04/02 17:29:18 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks) [04/02 17:29:18 d2.utils.events]: iter: 0 lr: N/A max_mem: 1401M Traceback (most recent call last): File "./tools/train_net.py", line 234, in args=(args,), File "/home/wangyixin/detectron2/detectron2/engine/launch.py", line 62, in launch main_func(args) File "./tools/train_net.py", line 221, in main return trainer.train() File "/home/wangyixin/detectron2/detectron2/engine/defaults.py", line 431, in train super().train(self.start_iter, self.max_iter) File "/home/wangyixin/detectron2/detectron2/engine/train_loop.py", line 138, in train self.run_step() File "/home/wangyixin/detectron2/detectron2/engine/defaults.py", line 441, in run_step self._trainer.run_step() File "/home/wangyixin/detectron2/detectron2/engine/train_loop.py", line 232, in run_step loss_dict = self.model(data) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/wangyixin/YOLOF/yolof/modeling/yolof.py", line 273, in forward features = self.backbone(images.tensor) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/wangyixin/YOLOF/yolof/modeling/backbone/darknet.py", line 368, in forward x = self.bn1(x) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward world_size = torch.distributed.get_world_size(process_group) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size return _get_group_size(group) File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size _check_default_pg() File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg "Default process group is not initialized" AssertionError: Default process group is not initialized

Ilin3170 commented 3 years ago

run with python ./tools/train_net.py --config-file ./configs/yolof_CSP_D_53_DC5_3x.yaml 报错,run python ./tools/train_net.py --config-file ./configs/yolof_R_50_C5_1x.yaml 就没问题,请问是哪里出了问题呢?

chensnathan commented 3 years ago

yolof_CSP_D_53_DC5_3x uses SyncBN while yolof_R_50_C5_1x uses normal BN.

You can run yolof_CSP_D_53_DC5_3x with multiple GPUs or replace all SyncBN with BN.