DeadAt0m / ActiveSparseShifts-PyTorch

Implementation of Sparse Shift Layer and Active Shift Layer (3D, 4D, 5D tensors) for PyTorch(CPU,GPU)
34 stars 4 forks source link

Experiments on mobileNet,shuffleNet an so on #2

Closed duduheihei closed 4 years ago

duduheihei commented 4 years ago

Thanks for sharing the implementation of SSL (Sparse Shift Layer). However, I have not found any model constructed from SSL. I want to ask a question: can SSL be easily applied to some classic models such as mobileNet and shuffleNet? Or can you provide some models constructed from SSL and show satisfactory result?

DeadAt0m commented 4 years ago

Thank you for interest for this repo.

1) I am not the author of the article, I just implement it for my own purposes. 2) Yes, it can be easily applied to any architecture, containing the depth-wise convolutions(like those you mentioned): MobileNetV1 example: a) Let's take random implementation b) Find declaration of depth-wise convolution layer and replace it on Shift2D(inp, init_stride=3, <other args if needed>) and that's all

3) Unfortunately, I cannot provide models with results of successful applying of SSL(causes it belongs to company where I am working). However, creating those examples still in TODO list and when I will have time I add it

Anyway I will glad for any contribution to this repo, Thank you!

duduheihei commented 4 years ago

Got it. Thanks a lot! And I will do some experiments on MobileNetV2 and ShuffleNet and discuss the results here.

duduheihei commented 4 years ago

I have replaced 3x3 conv layers in mobileNetV2 with ShiftLayer, and the precision is obviously lower than original model. I try to set different params of Shift Layer such as init_stride and active_flag and get slightly better results. However there is still a significant gap between the "shifted mobileNetV2" and the original model. Could you provide some advices for this problem?

DeadAt0m commented 4 years ago
  1. Can you share me how the model look likes after your changes?
  2. Moreover, it is better to share the precision values.
  3. a) Init_stride is important and should be not less as kernel_size of replaced dw conv. It responsible for initializing shifts sizes for each channel uniformly from [-init_stride, init_stride]. I also think, that sometimes such initialization is not good solution at all, but you can implement any initialization for Shifts weights, because it accessible directly .weight attribute b) active_flag - stands for computation of forward pass via billinear interpolation(like it happens always in backward) and this article. c) by default layer used zero padding, however it also may be not good solution, due to information loss. You can considering: 'border', 'reflect', 'symmetric' padding modes. d) More important is sparsity_term! if it not equivalent to 0., than layer gives two outputs, where second is l1 regularization on weight. You can add it to general loss. IMPORTANT by default sparsity_term=5e-4 and hence local loss computation is occurs.
duduheihei commented 4 years ago

Thanks for your advices. Here are details in my situation:

  1. The backbone is MobileNetV2-0.5,I replace 3x3 convolution layers in the LinearBottleneck blocks. When the stride of 3x3 convolution is 2, I replace it with 2x2 average pooling. When the stride is 1, I replace 3x3 convolution layer with ShiftLayer with init_stride 3.
  2. I trained the model on non-public dataset, it has 3 classes in total. The training precision drops from 94% down to 88% on average. 3.I do not utilize sparse term, because I think it will reduce the capability of network .
Eunhui-Kim commented 4 years ago

@

Thanks for your advices. Here are details in my situation:

  1. The backbone is MobileNetV2-0.5,I replace 3x3 convolution layers in the LinearBottleneck blocks. When the stride of 3x3 convolution is 2, I replace it with 2x2 average pooling. When the stride is 1, I replace 3x3 convolution layer with ShiftLayer with init_stride 3.
  2. I trained the model on non-public dataset, it has 3 classes in total. The training precision drops from 94% down to 88% on average. 3.I do not utilize sparse term, because I think it will reduce the capability of network .

I think you need to control the ShiftLayer init_stride value.

I got similar accuracy performance using this code with tensorflow open source code by this code for active shift implementation.

Eunhui-Kim commented 4 years ago
  1. Can you share me how the model look likes after your changes?
  2. Moreover, it is better to share the precision values.
  3. a) Init_stride is important and should be not less as kernel_size of replaced dw conv. It responsible for initializing shifts sizes for each channel uniformly from [-init_stride, init_stride]. I also think, that sometimes such initialization is not good solution at all, but you can implement any initialization for Shifts weights, because it accessible directly .weight attribute b) active_flag - stands for computation of forward pass via billinear interpolation(like it happens always in backward) and this article. c) by default layer used zero padding, however it also may be not good solution, due to information loss. You can considering: 'border', 'reflect', 'symmetric' padding modes. d) More important is sparsity_term! if it not equivalent to 0., than layer gives two outputs, where second is l1 regularization on weight. You can add it to general loss. IMPORTANT by default sparsity_term=5e-4 and hence local loss computation is occurs.

Thank you for your answer. I was wonder what the the return second value is.

duduheihei commented 4 years ago

@

Thanks for your advices. Here are details in my situation:

  1. The backbone is MobileNetV2-0.5,I replace 3x3 convolution layers in the LinearBottleneck blocks. When the stride of 3x3 convolution is 2, I replace it with 2x2 average pooling. When the stride is 1, I replace 3x3 convolution layer with ShiftLayer with init_stride 3.
  2. I trained the model on non-public dataset, it has 3 classes in total. The training precision drops from 94% down to 88% on average. 3.I do not utilize sparse term, because I think it will reduce the capability of network .

I think you need to control the ShiftLayer init_stride value.

I got similar accuracy performance using this code with tensorflow open source code by this code for active shift implementation.

Could you tell me which backbone and dataset you use?

duduheihei commented 4 years ago

Here is the definition of basic LinearBottleneck with ShiftLayer and I do not change other module of MobileNet: I replace 3x3 convolution layers in the LinearBottleneck blocks. When the stride of 3x3 convolution is 2, I replace it with 2x2 average pooling. When the stride is 1, I replace 3x3 convolution layer with ShiftLayer with init_stride 3.

class LinearBottleneck(nn.Module):
    def __init__(self, inplanes, outplanes, stride=1, t=6, activation=nn.ReLU6):
        super(LinearBottleneck, self).__init__()
        self.conv1 = nn.Conv2d(inplanes, inplanes * t, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(inplanes * t)

        if stride != 1:
            self.ave_pool = nn.AvgPool2d(kernel_size=2,stride=stride,padding=1)
        else:
            self.shiftlayer = Shift2D(in_channels=inplanes * t,init_stride=3,active_flag=True)

        self.conv3 = nn.Conv2d(inplanes * t, outplanes, kernel_size=1, stride=stride, bias=False)
        self.bn3 = nn.BatchNorm2d(outplanes)
        self.activation = activation(inplace=True)
        self.stride = stride
        self.t = t
        self.inplanes = inplanes
        self.outplanes = outplanes

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.activation(out)
        if self.stride!=1:
            out = self.ave_pool(out)
        else:
            out,_ = self.shiftlayer(out)
        out = self.conv3(out)
        out = self.bn3(out)

        if self.stride == 1 and self.inplanes == self.outplanes:
            out += residual
        return out
DeadAt0m commented 4 years ago

@duduheihei So, I look on your code: 1) self.conv3 = nn.Conv2d(inplanes * t, outplanes, kernel_size=1, stride=stride, bias=False) has stride in arguments, hence in case of stride>1 this convolution is also reduce tensor twice(by ignoring each second element of input tensor during convolution) 2) I do not understand, why you replaced shift on pooling in case of stride>2? My vision is following:

self.bn1 = nn.BatchNorm2d(inplanes * t)
self.shiftlayer = Shift2D(in_channels=inplanes * t,init_stride=3,active_flag=True)
if stride != 1:
   # MaxPool is also good variant here
   self.pool = nn.AvgPool2d(kernel_size=2,stride=stride,padding=1)
self.conv3 = nn.Conv2d(inplanes * t, outplanes, kernel_size=1, stride=1, bias=False)

Or more simple version with stride in last conv:

self.bn1 = nn.BatchNorm2d(inplanes * t)
self.shiftlayer = Shift2D(in_channels=inplanes * t,init_stride=3,active_flag=True)
self.conv3 = nn.Conv2d(inplanes * t, outplanes, kernel_size=1, stride=stride, bias=False)
Eunhui-Kim commented 4 years ago

@

Thanks for your advices. Here are details in my situation:

  1. The backbone is MobileNetV2-0.5,I replace 3x3 convolution layers in the LinearBottleneck blocks. When the stride of 3x3 convolution is 2, I replace it with 2x2 average pooling. When the stride is 1, I replace 3x3 convolution layer with ShiftLayer with init_stride 3.
  2. I trained the model on non-public dataset, it has 3 classes in total. The training precision drops from 94% down to 88% on average. 3.I do not utilize sparse term, because I think it will reduce the capability of network .

I think you need to control the ShiftLayer init_stride value. I got similar accuracy performance using this code with tensorflow open source code by this code for active shift implementation.

Could you tell me which backbone and dataset you use?

I used the res-IB-SSL NN architecture which the paperhttps://arxiv.org/abs/1903.05285 proposed applying in cifar10 dataset. The resnet code is used in this site https://github.com/akamaster/pytorch_resnet_cifar10.

duduheihei commented 4 years ago

@duduheihei So, I look on your code:

  1. self.conv3 = nn.Conv2d(inplanes * t, outplanes, kernel_size=1, stride=stride, bias=False) has stride in arguments, hence in case of stride>1 this convolution is also reduce tensor twice(by ignoring each second element of input tensor during convolution)
  2. I do not understand, why you replaced shift on pooling in case of stride>2? My vision is following:
self.bn1 = nn.BatchNorm2d(inplanes * t)
self.shiftlayer = Shift2D(in_channels=inplanes * t,init_stride=3,active_flag=True)
if stride != 1:
   # MaxPool is also good variant here
   self.pool = nn.AvgPool2d(kernel_size=2,stride=stride,padding=1)
self.conv3 = nn.Conv2d(inplanes * t, outplanes, kernel_size=1, stride=1, bias=False)

Or more simple version with stride in last conv:

self.bn1 = nn.BatchNorm2d(inplanes * t)
self.shiftlayer = Shift2D(in_channels=inplanes * t,init_stride=3,active_flag=True)
self.conv3 = nn.Conv2d(inplanes * t, outplanes, kernel_size=1, stride=stride, bias=False)

I am sorry for my mistake that downsample operation twice in the block. The "more simple version" you provide is the same as I experiment first time, but the precision is obviously lower. Therefore I implement downsample operation by replacing 1x1 conv with stride 2 with 2x2 average pooling with stide 2. Unfortunaly, I forgot to adjust the param of stride in 1x1 conv, which make the block do downsample twice. Follow the vesion you provide, I have done experiments again, and got satisfactory precision. Here are two versions that works for me, for simlification, code sample do not contain batchnorm and activation function: Version 1:

        self.conv1 = nn.Conv2d(inplanes, inplanes * t, kernel_size=1, bias=False)

        if stride != 1:
            self.ave_pool = nn.AvgPool2d(kernel_size=2,stride=stride,padding=1)
        else:
            self.shiftlayer = Shift2D(in_channels=inplanes * t,init_stride=3,active_flag=True)

        self.conv3 = nn.Conv2d(inplanes * t, outplanes, kernel_size=1, stride=1, bias=False)

version 2:

        self.conv1 = nn.Conv2d(inplanes, inplanes * t, kernel_size=1, bias=False)
        self.shiftlayer = Shift2D(in_channels=inplanes * t,init_stride=3,active_flag=True)
        if stride != 1:
            self.ave_pool = nn.AvgPool2d(kernel_size=2,stride=stride,padding=1)
        self.conv3 = nn.Conv2d(inplanes * t, outplanes, kernel_size=1, stride=1, bias=False)
duduheihei commented 4 years ago

cifar10

Thanks for your reply. Now I get satisfactory result on MobileNetV2,and the sample code is shown in discussion above.