Open shikishima-TasakiLab opened 3 years ago
@shikishima-TasakiLab My compilation fails with the error info:
src/pytorch_wrapper.cpp:12:65: required from here
/usr/include/c++/8/bits/move.h:87:21: error: static assertion failed: template argument substituting _Tp is an lvalue reference type
static_assert(!std::is_lvalue_reference<_Tp>::value, "template argument"
^~~~
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
@d-li14 That error info alone will not help us determine the cause of the error. To me it looks like you tried to compile the C++ source code with the C compilation settings.
Are you running the following command in the environment where PyTorch is installed to compile it?
python3 setup.py build
or
python3 setup.py install
@shikishima-TasakiLab Yes, I am running python3 setup.py build
ENV: CUDA 11.0, gcc/g++ 8.3.0, pytorch 1.7.1+cu110
@d-li14 After trying it out in various environments, it seems that my implementation only works with the latest PyTorch 1.9.0.
In the following Docker environment, I was able to build.
@shikishima-TasakiLab I see. Since PyTorch 1.9.0 is too new, would it be possible to modify your implementation to support backward compatibility? It would be helpful for people with more common environments.
@d-li14 I'll try.
@d-li14 By modifying some parts of the code, I was able to get my implementation to work with PyTorch 1.7.0 and later.
@shikishima-TasakiLab Good Job. I will retry soon.
Hi, thank you very much for implementing this, it seems to work very well in full precision mode. However, I do get some issues with numerical stability when used automatic mixed precision training (loss goes to nan in a few steps). I am guessing that the CUDA implementation expects a full precision input but AMP gives it half precision.
As a quick workaround to I made a patch to _involution2d so I could at least use the rest of my network with mixed precision while using this.
def _involution2d(
input: torch.Tensor,
weight: torch.Tensor,
kernel_size: Union[int, Tuple[int, int]] = 7,
stride: Union[int, Tuple[int, int]] = 1,
padding: Union[int, Tuple[int, int]] = 0,
dilation: Union[int, Tuple[int, int]] = 1,
groups: int = 1,
bias: torch.Tensor = None,
) -> torch.Tensor:
kernel_size_ = _pair(kernel_size)
stride_ = _pair(stride)
padding_ = _pair(padding)
dilation_ = _pair(dilation)
if input.dtype == torch.half:
input = input.float()
output: torch.Tensor = ops.involution.involution2d(input, weight, kernel_size_, stride_, padding_, dilation_, groups)
if bias is not None:
output += bias.view(1, -1, 1, 1)
return output
@shikishima-TasakiLab When I test inference speed with RedNet-101 on a single V100 GPU, your CUDA implementation seems to be slower. The throughput is 523 images/s, while our official implementation is 668 images/s (batch size 256). I wonder why there is this difference between testing a single involution op on 2080Ti as you reported.
close #44