facebookresearch / DeltaCNN

DeltaCNN End-to-End CNN Inference of Sparse Frame Differences in Videos
Other
60 stars 6 forks source link

About resnet50 #3

Closed CheungBH closed 1 year ago

CheungBH commented 1 year ago

Hello. Would you release the resenet50 inference demo?

CheungBH commented 1 year ago

I am trying to build a resnet implementation based on your code. It is almost finished. However, a CUDA error occurs to me:

_File "/media/sda1/baoheng/DeltaCNN/example/resnet_deltacnn.py", line 282, in _forward_impl x = self.layer1(x) File "/home/baoheng/anaconda3/envs/delta/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/home/baoheng/anaconda3/envs/delta/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/home/baoheng/anaconda3/envs/delta/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/media/sda1/baoheng/DeltaCNN/example/resnet_deltacnn.py", line 145, in forward out = self.bn1(out) File "/home/baoheng/anaconda3/envs/delta/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/home/baoheng/.local/lib/python3.8/site-packages/torchdeltacnn-0.0.0-py3.8-linux-x86_64.egg/deltacnn/sparse_layers.py", line 1556, in forward self.convert_to_scale_offset(input) File "/home/baoheng/.local/lib/python3.8/site-packages/torchdeltacnn-0.0.0-py3.8-linux-x86_64.egg/deltacnn/sparse_layers.py", line 1534, in convert_to_scale_offset bn_scale = self.weight torch.rsqrt(self.runningvar + self.eps) RuntimeError: CUDA error: an illegal memory access was encountered**

Do you have any ideas about it? It occurs when calculating

### self.running_var + self.eps

dabeschte commented 1 year ago

Hi! Sorry, but I only have a special version of ResNet available where all convolutions are fused with batchnorms. For that to use, you would have to fuse the weights.

But ResNet should hopefully not be too difficult to implement and I am happy to help you. This exceptuin actually does not have to originate from the call where it was raised. It is very likely that the error happened earlier, for example being caused by a missing conversion of the weights (see readme) or by a missing conversion of the memory layout to channels last for either the input or the model. To find out where the error is coming from, you can add "torch.cuda.synchronize()" calls after every layer. I assume the error happens somewhere in the first convolutional layer. If you share your code, I will take a look at it when I find some time

CheungBH commented 1 year ago

Hi. Thank you for your reply. Here's my code in a zip file. I have checked the weight conversion and memory layout, and it seems OK from my personal perspective.

resnet_delta.zip

dabeschte commented 1 year ago

I think I found the issues. First of all, and this is a tricky one: every single place where you call an activation function needs a unique instance of DCActivation. The reason for this is that DeltaCNN has to buffer previously accumulated inputs for nonlinear layers like activation and pooling layers in order to work correctly. This means that we need to store the previous inputs at every stage in the neural network where an activation layer is called.

To fix this, you need to use multiple DCActivation objects in your BasicBlock and Bottleneck classes. The second problem is that adding two sparse tensors together requires you to instantiate DCAdd() like any other neural network layer. Please see the example below. Unfortunately, I cannot the the code on the PC that I am currently on, so please be aware that there could be some bugs/typos.

The solution for the Bottleneck method should look something like. Please fix the BasicBlock the same way.

class DeltaCNN_Bottleneck(DCModule):
    # Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
    # while original implementation places the stride at the first 1x1 convolution(self.conv1)
    # according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
    # This variant is also known as ResNet V1.5 and improves accuracy according to
    # https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.

    expansion: int = 4

    def __init__(
            self,
            inplanes: int,
            planes: int,
            stride: int = 1,
            downsample: Optional[nn.Module] = None,
            groups: int = 1,
            base_width: int = 64,
            dilation: int = 1,
            norm_layer: Optional[Callable[..., nn.Module]] = None,
    ) -> None:
        super().__init__()
        if norm_layer is None:
            norm_layer = DCBatchNorm2d
        width = int(planes * (base_width / 64.0)) * groups
        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
        self.conv1 = deltacnn_conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = deltacnn_conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = deltacnn_conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu1 = DCActivation(inplace=True)
        self.relu2 = DCActivation(inplace=True)
        self.relu3 = DCActivation(inplace=True)
        self.dc_add = DCAdd()
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu1(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu2(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out = self.dc_add(out, identity)
        out = self.relu3(out)
        # meta["block_id"] += 1

        return out
CheungBH commented 1 year ago

Thank you for your reply. It runs successfully. However, I found a phenomenon when profiling: As the mask gets denser (delta smaller), your mobilenetv2 demo shows little changes in speedup (near 1.6-1.9X); While when using my resnet, the speedup drops increasingly when the mask gets larger. Take resnet50 as an example, when the active mask ratio is around 10%, there is almost 2X speedup; While the active ratio increases to approximately 70%, the model runs even slower than the original one. As for BasicBlock structures like resnet18 or resnet34, there is almost no actual speedup regardless of the active masking ratio. Have you met such a phenomenon before?

dabeschte commented 1 year ago

Hey! Difficult to say what exactly might be the problem here. There are many potential causes:

cuDNN should typically not be faster than DeltaCNN. However, it can happen. The most likely cause here is that PyTorch uses automatic conversion to TF32 which enables the use of tensor cores - since we did not implement tensor core inference, this is not a fair comparison, but of course it is faster in practice. you can disable it using: torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.allow_tf32 = False