When I use apex on 2080ti, I get the following error, how can I solve it?

yeyuanzheng177 commented 4 years ago

RuntimeError: expected scalar type Float but found Half (data at /usr/local/lib/python3.5/dist-packages/torch/include/ATen/core/TensorMethods.h:1386) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f9f2dbdd441 in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f9f2dbdcd7a in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10.so) frame #2: float* at::Tensor::data() const + 0xcf (0x7f9f1c69fa2f in /home/yyz/bigdisk/CenterNet-master0/src/lib/models/networks/DCNv2/_ext.cpython-35m-x86_64-linux-gnu.so) frame #3: dcn_v2_cuda_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, int, int, int, int, int, int, int, int, int) + 0xbc0 (0x7f9f1c6a4b50 in /home/yyz/bigdisk/CenterNet-master0/src/lib/models/networks/DCNv2/_ext.cpython-35m-x86_64-linux-gnu.so) frame #4: dcn_v2_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, int, int, int, int, int, int, int, int, int) + 0x8b (0x7f9f1c689c0b in /home/yyz/bigdisk/CenterNet-master0/src/lib/models/networks/DCNv2/_ext.cpython-35m-x86_64-linux-gnu.so) frame #5: + 0x1f91c (0x7f9f1c69591c in /home/yyz/bigdisk/CenterNet-master0/src/lib/models/networks/DCNv2/_ext.cpython-35m-x86_64-linux-gnu.so) frame #6: + 0x1f99e (0x7f9f1c69599e in /home/yyz/bigdisk/CenterNet-master0/src/lib/models/networks/DCNv2/_ext.cpython-35m-x86_64-linux-gnu.so) frame #7: + 0x1cdc0 (0x7f9f1c692dc0 in /home/yyz/bigdisk/CenterNet-master0/src/lib/models/networks/DCNv2/_ext.cpython-35m-x86_64-linux-gnu.so)

frame #11: python3() [0x4e3423] frame #14: THPFunction_apply(_object*, _object*) + 0x6b1 (0x7f9f2e3bf491 in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so) frame #18: python3() [0x4e3537] frame #22: python3() [0x4e3423] frame #24: python3() [0x4f08be] frame #26: python3() [0x55fbf6] frame #30: python3() [0x4e3537] frame #34: python3() [0x4e3423] frame #36: python3() [0x4f08be] frame #38: python3() [0x55fbf6] frame #42: python3() [0x4e3537] frame #46: python3() [0x4e3423] frame #48: python3() [0x4f08be] frame #50: python3() [0x55fbf6] frame #54: python3() [0x4e3537] frame #58: python3() [0x4e3423] frame #60: python3() [0x4f08be] frame #62: python3() [0x55fbf6]

ming71 commented 4 years ago

Same error with you, have you worked out ?

JesseYang commented 4 years ago

This adds apex support with level O1. But I got the following error when running it. RuntimeError: Function _DCNv2Backward returned an invalid gradient at index 1 - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor

JesseYang commented 4 years ago

The problem is solved. Besides following the code of https://github.com/lbin/DCNv2, I add the following three lines before return in the _backward function of _DCNv2 in dcn_v2.py:

grad_input = grad_input.half()
grad_offset = grad_offset.half()
grad_mask = grad_mask.half()

zhangjinsong3 commented 4 years ago

The problem is solved. Besides following the code of https://github.com/lbin/DCNv2, I add the following three lines before return in the _backward function of _DCNv2 in dcn_v2.py:
grad_input = grad_input.half()
grad_offset = grad_offset.half()
grad_mask = grad_mask.half()

Did you modified any other code besides dcn_v2.py? I modified those code but still get the same error as @yeyuanzheng177 .

GilbertTam commented 3 years ago

me too,some one has any update?

steven22tom commented 3 years ago

In dcn_v2.py the "_backend.dcn_v2_forward" and "_backend.dcn_v2_backward" only expect float32 input. So if you use mix-precision(Apex/amp)，you should convert float16 to float32, and convert the final output from float32 to float16. You can consult https://github.com/jasonkena/yolact/tree/amp/external/DCNv2

ttjjmm commented 3 years ago

In dcn_v2.py the "_backend.dcn_v2_forward" and "_backend.dcn_v2_backward" only expect float32 input. So if you use mix-precision(Apex/amp)，you should convert float16 to float32, and convert the final output from float32 to float16. You can consult https://github.com/jasonkena/yolact/tree/amp/external/DCNv2 @steven22tom Thank you for your hint, It works!

CharlesShang / DCNv2

When I use apex on 2080ti, I get the following error, how can I solve it? #42