chengdazhi / Deformable-Convolution-V2-PyTorch

Deformable ConvNets V2 (DCNv2) in PyTorch
MIT License
1.44k stars 229 forks source link

RuntimeError: batch % im2col_step_ == 0 ASSERT FAILED #23

Open gzhcv opened 5 years ago

gzhcv commented 5 years ago

Environment:

Error occurs as follows when the batch_size is 128 or 192 or 256. But when batch_size = 64 is ok.

File "/home/gzh/SoftWare/tf1.4/anaconda2/envs/python36/lib/python3.6/site-packages/DCN-1.0-py3.6-linux-x86_64.egg/modules/deform_conv.py", line 62, in forward
    self.im2col_step)
  File "/home/gzh/SoftWare/tf1.4/anaconda2/envs/python36/lib/python3.6/site-packages/DCN-1.0-py3.6-linux-x86_64.egg/functions/deform_conv_func.py", line 34, in forward
    ctx.im2col_step)

RuntimeError: batch % im2col_step_ == 0 ASSERT FAILED at /home/gzh/ocr/Deformable-Convolution-V2-PyTorch-pytorch_1.0.0/src/cuda/deform_conv_cuda.cu:58, please report a bug to PyTorch. batch(%d) must divide im2col_step(%d)11764 (deform_conv_cuda_forward at /home/gzh/ocr/Deformable-Convolution-V2-PyTorch-pytorch_1.0.0/src/cuda/deform_conv_cuda.cu:58)

frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f49387f4fe1 in /home/gzh/SoftWare/tf1.4/anaconda2/envs/python36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f49387f4dfa in /home/gzh/SoftWare/tf1.4/anaconda2/envs/python36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: deform_conv_cuda_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, int, int, int, int, int, int, int, int, int, int, int) + 0x909 (0x7f493475a884 in /home/gzh/SoftWare/tf1.4/anaconda2/envs/python36/lib/python3.6/site-packages/DCN-1.0-py3.6-linux-x86_64.egg/DCN.cpython-36m-x86_64-linux-gnu.so)
frame #3: deform_conv_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, int, int, int, int, int, int, int, int, int, int, int) + 0x79 (0x7f493473fb89 in /home/gzh/SoftWare/tf1.4/anaconda2/envs/python36/lib/python3.6/site-packages/DCN-1.0-py3.6-linux-x86_64.egg/DCN.cpython-36m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x2dc97 (0x7f493474cc97 in /home/gzh/SoftWare/tf1.4/anaconda2/envs/python36/lib/python3.6/site-packages/DCN-1.0-py3.6-linux-x86_64.egg/DCN.cpython-36m-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x2dd3e (0x7f493474cd3e in /home/gzh/SoftWare/tf1.4/anaconda2/envs/python36/lib/python3.6/site-packages/DCN-1.0-py3.6-linux-x86_64.egg/DCN.cpython-36m-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x2a209 (0x7f4934749209 in /home/gzh/SoftWare/tf1.4/anaconda2/envs/python36/lib/python3.6/site-packages/DCN-1.0-py3.6-linux-x86_64.egg/DCN.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #14: THPFunction_apply(_object*, _object*) + 0x581 (0x7f4972b264d1 in /home/gzh/SoftWare/tf1.4/anaconda2/envs/python36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

I found that the im2col_step = 1 by default, it's strange why batch % im2col_step_ != 0 as the bug refers to?

Ien001 commented 5 years ago

Hi,

I also encountered this bug, it seems some details in "DCN" lib are not handled well enough to avoid this problem, so when it comes to the last step during training process, some tensors cannot be divided.

My solution is to set batch_size = 1, and it will work but time-consumingly.

@chengdazhi Hope u can fix this bug, it shouldn't take u much time.

Thanks! Ian

heartInsert commented 5 years ago

Yes,when I set batch_size to 64,it will not wrong here

But sometimes my model likes to increase the loss but not decrease the loss when he reached at 22% accuracy , backward to the contrary side ? It's very funny

JeffWang987 commented 2 years ago

Hi, I came across the same problem, but it seems that it is not a bug. Because the "im2col_step" is a parameter. As is shown bellow:

image

We can always choose a proper "im2col_step" ensuring that "batch % im2colstep == 0".

To this end, I believe the problem is solved. However, I would like to know how "im2col_step" influences efficiency? Does a bigger value boost efficiency?

Qi-Zhangyang commented 2 years ago

I agree with you, I also wonder how im2col_step influence the results.

Note-Liu commented 1 year ago

I set batch_size = 64 ,and im2col_step=64,but still meet the error