训练出错 - Githubissues

DG205071 commented 4 years ago

(A) ff@ff:~/DG/AerialDetection$ bash ./tools/dist_train.sh configs/DOTA/faster_rcnn_h-obb_r50_fpn_1x_dota.py 2 2020-08-07 11:15:06,651 - INFO - Distributed training: True 2020-08-07 11:15:06,978 - INFO - load model from: modelzoo://resnet50 /home/ff/anaconda3/envs/A/lib/python3.7/site-packages/mmcv/runner/checkpoint.py:145: UserWarning: The URL scheme of "modelzoo://" is deprecated, please use "torchvision://" instead warnings.warn('The URL scheme of "modelzoo://" is deprecated, please ' /home/ff/anaconda3/envs/A/lib/python3.7/site-packages/mmcv/runner/checkpoint.py:145: UserWarning: The URL scheme of "modelzoo://" is deprecated, please use "torchvision://" instead warnings.warn('The URL scheme of "modelzoo://" is deprecated, please ' 2020-08-07 11:15:09,074 - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

loading annotations into memory... loading annotations into memory... Done (t=0.98s) creating index... Done (t=1.00s) creating index... index created! index created! Traceback (most recent call last): File "./tools/train.py", line 95, in main() File "./tools/train.py", line 91, in main logger=logger) File "/home/ff/DG/AerialDetection/mmdet/apis/train.py", line 59, in train_detector _dist_train(model, dataset, cfg, validate=validate) File "/home/ff/DG/AerialDetection/mmdet/apis/train.py", line 144, in _dist_train model = MMDistributedDataParallel(model.cuda()) File "/home/ff/anaconda3/envs/A/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 288, in init self._ddp_init_helper() File "/home/ff/anaconda3/envs/A/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 306, in _ddp_init_helper self._module_copies = replicate(self.module, self.device_ids, detach=True) File "/home/ff/anaconda3/envs/A/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 97, in replicate param_copies = _broadcast_coalesced_reshape(params, devices, detach) File "/home/ff/anaconda3/envs/A/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 76, in _broadcast_coalesced_reshape return comm.broadcast_coalesced(tensors, devices) File "/home/ff/anaconda3/envs/A/lib/python3.7/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: all tensors must be on devices[0] 2020-08-07 11:15:12,288 - INFO - Start running, host: ff@ff, work_dir: /home/ff/DG/AerialDetection/work_dirs/faster_rcnn_h-obb_r50_fpn_1x_dota 2020-08-07 11:15:12,288 - INFO - workflow: [('train', 1)], max: 12 epochs

到这里卡住不动，我在单卡1060上可以运行，在双卡2070的机子上，就会出现此问题

BlackHandguy commented 3 years ago

您好，我出现了和您同样的问题，请问您知道如何解决了吗？

DingShengLin commented 3 years ago

Have you solved this problem?

BlackHandguy commented 3 years ago

Have you solved this problem?

no

Complicateddd commented 3 years ago

pip install mmcv=0.2.16 may solve

dingjiansw101 / AerialDetection

训练出错 #32