训练solov2_light_512_dcn_r50_fpn_8gpu_3x.py时出错！

WYQ-Github commented 3 years ago

你好作者，环境是Ubuntu20.04 touch=1.8 python=3.7 mmcv=0.2.16 mmdet=1.0.0 用配置文件solov2_light_512_dcn_r50_fpn_8gpu_3x.py 训练出现以下错误：其他配置文件没有问题。 Traceback (most recent call last): File "tools/train.py", line 125, in <module> main() File "tools/train.py", line 121, in main timestamp=timestamp) File "/home/yuqing/桌面/program/SOLO/SOLO-master/mmdet/apis/train.py", line 111, in train_detector timestamp=timestamp) File "/home/yuqing/桌面/program/SOLO/SOLO-master/mmdet/apis/train.py", line 297, in _non_dist_train runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/yuqing/anaconda3/envs/solo/lib/python3.7/site-packages/mmcv/runner/runner.py", line 364, in run epoch_runner(data_loaders[i], **kwargs) File "/home/yuqing/anaconda3/envs/solo/lib/python3.7/site-packages/mmcv/runner/runner.py", line 275, in train self.call_hook('after_train_iter') File "/home/yuqing/anaconda3/envs/solo/lib/python3.7/site-packages/mmcv/runner/runner.py", line 231, in call_hook getattr(hook, fn_name)(self) File "/home/yuqing/anaconda3/envs/solo/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 18, in after_train_iter runner.outputs['loss'].backward() File "/home/yuqing/anaconda3/envs/solo/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/yuqing/anaconda3/envs/solo/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag File "/home/yuqing/anaconda3/envs/solo/lib/python3.7/site-packages/torch/autograd/function.py", line 89, in apply return self._forward_cls.backward(self, *args) # type: ignore File "/home/yuqing/anaconda3/envs/solo/lib/python3.7/site-packages/torch/autograd/function.py", line 210, in wrapper outputs = fn(ctx, *args) File "/home/yuqing/桌面/program/SOLO/SOLO-master/mmdet/ops/dcn/deform_conv.py", line 92, in backward cur_im2col_step) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

wolfworld6 commented 2 years ago

did you solve this error?

lin0ww commented 2 years ago

请问您解决这个问题了吗？怎么解决呢？

DHNicoles commented 2 years ago

solved! _mmdet/ops/dcn/src/deform_convcuda.cpp#L433 gradOutputBuffer = gradOutputBuffer.view({batchSize / im2col_step, nOutputPlane, im2col_step * outputHeight, outputWidth}); change to: gradOutputBuffer =gradOutputBuffer.contiguous().view({batchSize / im2col_step, nOutputPlane, im2col_step * outputHeight, outputWidth});

Wong-denis commented 1 year ago

solved! _mmdet/ops/dcn/src/deform_convcuda.cpp#L433 gradOutputBuffer = gradOutputBuffer.view({batchSize / im2col_step, nOutputPlane, im2col_step * outputHeight, outputWidth}); change to: gradOutputBuffer =gradOutputBuffer.contiguous().view({batchSize / im2col_step, nOutputPlane, im2col_step * outputHeight, outputWidth});

Thank you for your method. However, I still got the same RuntimeError in first post after changing gradOutputBuffer.view to gradOutputBuffer.contiguous().view. Do you have any clue about what might fix this issue?

WXinlong / SOLO

训练solov2_light_512_dcn_r50_fpn_8gpu_3x.py时出错！ #174