Fast and generic implementation using OpenMP and CUDA

shikishima-TasakiLab commented 3 years ago

close #44

d-li14 commented 3 years ago

@shikishima-TasakiLab My compilation fails with the error info:

src/pytorch_wrapper.cpp:12:65:   required from here
/usr/include/c++/8/bits/move.h:87:21: error: static assertion failed: template argument substituting _Tp is an lvalue reference type
       static_assert(!std::is_lvalue_reference<_Tp>::value, "template argument"
                     ^~~~
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

shikishima-TasakiLab commented 3 years ago

@d-li14 That error info alone will not help us determine the cause of the error. To me it looks like you tried to compile the C++ source code with the C compilation settings.

Are you running the following command in the environment where PyTorch is installed to compile it?

python3 setup.py build

or

python3 setup.py install

d-li14 commented 3 years ago

@shikishima-TasakiLab Yes, I am running python3 setup.py build ENV: CUDA 11.0, gcc/g++ 8.3.0, pytorch 1.7.1+cu110

shikishima-TasakiLab commented 3 years ago

@d-li14 After trying it out in various environments, it seems that my implementation only works with the latest PyTorch 1.9.0.

In the following Docker environment, I was able to build.

d-li14 commented 3 years ago

@shikishima-TasakiLab I see. Since PyTorch 1.9.0 is too new, would it be possible to modify your implementation to support backward compatibility? It would be helpful for people with more common environments.

shikishima-TasakiLab commented 3 years ago

@d-li14 I'll try.

shikishima-TasakiLab commented 3 years ago

@d-li14 By modifying some parts of the code, I was able to get my implementation to work with PyTorch 1.7.0 and later.

d-li14 commented 3 years ago

@shikishima-TasakiLab Good Job. I will retry soon.

csvance commented 3 years ago

Hi, thank you very much for implementing this, it seems to work very well in full precision mode. However, I do get some issues with numerical stability when used automatic mixed precision training (loss goes to nan in a few steps). I am guessing that the CUDA implementation expects a full precision input but AMP gives it half precision.

As a quick workaround to I made a patch to _involution2d so I could at least use the rest of my network with mixed precision while using this.

def _involution2d(
        input: torch.Tensor,
        weight: torch.Tensor,
        kernel_size: Union[int, Tuple[int, int]] = 7,
        stride: Union[int, Tuple[int, int]] = 1,
        padding: Union[int, Tuple[int, int]] = 0,
        dilation: Union[int, Tuple[int, int]] = 1,
        groups: int = 1,
        bias: torch.Tensor = None,
    ) -> torch.Tensor:
    kernel_size_ = _pair(kernel_size)
    stride_ = _pair(stride)
    padding_ = _pair(padding)
    dilation_ = _pair(dilation)

    if input.dtype == torch.half:
        input = input.float()
    output: torch.Tensor = ops.involution.involution2d(input, weight, kernel_size_, stride_, padding_, dilation_, groups)

    if bias is not None:
        output += bias.view(1, -1, 1, 1)

    return output

d-li14 commented 3 years ago

@shikishima-TasakiLab When I test inference speed with RedNet-101 on a single V100 GPU, your CUDA implementation seems to be slower. The throughput is 523 images/s, while our official implementation is 668 images/s (batch size 256). I wonder why there is this difference between testing a single involution op on 2080Ti as you reported.

d-li14 / involution

Fast and generic implementation using OpenMP and CUDA #45