NorbertZheng / read-papers

My paper reading notes.
MIT License
8 stars 0 forks source link

Sik-Ho Tang | Review: NIN -- Network In Network (Image Classification). #86

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: NIN — Network In Network (Image Classification).

NorbertZheng commented 1 year ago

Overview

Using Convolution Layers With 1×1 Convolution Kernels. image A few example images from the CIFAR10 dataset.

Inthis story, Network In Network (NIN), by Graduate School for Integrative Sciences and Engineering and National University of Singapore, is briefly reviewed.

This is a 2014 ICLR paper with more than 2300 citations.

NorbertZheng commented 1 year ago

Linear Convolutional Layer VS MLP Convolutional Layer

Linear Convolutional Layer

image Linear Convolutional Layer.

$$ f{i,j,k}=max(w{k}^{T}x_{i,j},0). $$

MLP Convolutional Layer

image MLP Convolutional Layer.

image

NorbertZheng commented 1 year ago

Fully Connected Layer VS Global Average Pooling Layer

image An Example of Fully Connected Layer VS Global Average Pooling Layer.

Fully Connected Layer

NorbertZheng commented 1 year ago

Global Average Pooling Layer is used to remove spatial translation along the convolution axises, where Conv*D Layer is used to remove spatial translation along the channel axis!!!

NorbertZheng commented 1 year ago

Enforcing correspondences between feature maps and categories means that feature maps are not correlated to contribute the final classification, leading to disentangled representation!!!

NorbertZheng commented 1 year ago

Overall Structure of Network In Network (NIN)

image Overall Structure of Network In Network (NIN).

NorbertZheng commented 1 year ago

Results

CIFAR-10

image Error Rates on CIFAR-10 Test Set.

image

As shown above, introducing dropout layers in between the MLP Convolutional Layers reduced the test error by more than 20%.

NorbertZheng commented 1 year ago

CIFAR-100

image Error Rates on CIFAR-100 Test Set.

Similarly, NIN + Dropout got only 35.68% error rate which is better than Maxout + Dropout.

NorbertZheng commented 1 year ago

Street View House Numbers (SVHN)

image Error Rates on SVHN Test Set.

However, NIN + Dropout got 2.35% error rate which is worse than DropConnect.

NorbertZheng commented 1 year ago

MNIST

image Error Rates on MNIST Test Set.

In MNIST, NIN + Dropout got 0.47% error rate which is worse than Maxout + Dropout a bit.

NorbertZheng commented 1 year ago

Global Average Pooling as a Regularizer

image Error Rates on CIFAR-10 Test Set.

With Global Average Pooling, NIN got 10.41% error rate which is better than fully connected + dropout of 10.88%.

NorbertZheng commented 1 year ago

In NIN, with 1×1 convolution, more non-linearity is introduced which makes the error rate lower.

NorbertZheng commented 1 year ago

Reference

[2014 ICLR] [NIN] Network In Network.

NorbertZheng commented 1 year ago

1x1 Convolution: Demystified

Anwesh Marwade. 1x1 Convolution: Demystified.

NorbertZheng commented 1 year ago

Overview

Shedding light on the concept of 1x1 convolution operation which appears in paper, Network in Network by Lin et al. and Google Inception.

Having read the Network in Network (NiN) paper by Lin et al a while ago, I came across

They called it the “cross channel parametric pooling layer” (if I remember correctly), comparing it to an operation involving convolution with a 1x1 convolutional kernel.

Skimming over the details at the time (as I often do with such esoteric terminology), I never thought I would be writing about this operation, let alone providing my own thoughts on its workings. But as it goes, it's usually the terminology that seems formidable and not the so much the concept itself; which is quite useful! Having completed the back-story and a pause for effect, let us demystify this peculiar but multi-purpose, 1x1 convolutional layer.

NorbertZheng commented 1 year ago

1x1 Convolution:

As the name suggests, the 1x1 convolution operation involves convolving the input with filters of size 1x1, usually with zero-padding and stride of 1.

Taking an example, let us suppose we have a (general-purpose) convolutional layer which outputs a tensor of shape $(B, K, H, W)$ where,

In addition, we specify a filter size that we want to work with, which is a single number for a square filter i.e. size=3 implies a 3x3 filter.

Feeding this tensor into our 1x1 convolution layer with $F$ filters (zero-padding and stride 1), we will get an output of shape $(B, F, H, W)$ changing our filter dimension from $K$ to $F$. Sweet!

Now depending on whether $F$ is less or greater than $K$, we have either decreased or increased the dimensionality of our input in the filter space without applying any spatial transformation (you’ll see). All this using the 1x1 convolution operation!

But wait, how was this any different from a regular convolution operation? In a regular convolution operation, we usually have a larger filter size like say, a 3x3 or 5x5 (or even 7x7) kernels which then generally entail some kind of padding to the input which in turn transforms it’s spatial dimensions of $H\times W$ to some $H'\times W’$; capisce? If not, here is the link for my go-to article for any (believe me) clarification on Convolutional Nets and its operations.

NorbertZheng commented 1 year ago

Benefits?

In CNNs, we often use some kind of pooling operations to

The danger of intractability is due to the number of generated activation maps which increases or rather blows up dramatically, proportional to the depth of a CNN. That is, the deeper a network, the larger the number of activation maps it generates. The problem is further exacerbated if the convolution operation is using large-sized filters like 5x5 or 7x7 filters, resulting in a significantly high number of parameters.

Understand: A Filter refer to the kernel that is being applied over the input in a sliding window fashion as a part of a convolution operation. An Activation Map on the other hand is the output result of applying a single filter over the input image. A convolution operation with multiple filters usually generates multiple (stacked) activation maps.

While it maintains important spatial features to some extent, there does exist a trade-off between down-sampling and information loss. Bottom line: we can only apply pooling to a certain extent.

This was heavily used in Google’s inception architecture (link in references) where they state the following:

They introduced the use of 1x1 convolutions to compute reductions before the expensive 3x3 and 5x5 convolutions. Instead of spatial dimensionality reduction using pooling, reduction may be applied in the filter dimension using 1x1 convolutions.

An interesting line of thought was provided by Yann LeCun where he analogizes fully connected layers in CNNs as simply convolution layers with 1x1 convolution kernels and a full connection table. See this post from 2015:

NorbertZheng commented 1 year ago

In Convolutional Nets, there is no such thing as "fully-connected layers". There are only convolution layers with 1x1 convolution kernels and a full connection table.

It's a too-rarely-understood fact that ConvNets don't need to have a fixed-size input. You can train them on inputs that happen to produce a single output vector (with no spatial extent), and then apply them to larger images. Instead of a single output vector, you then get a spatial map of output vectors. Each vector sees input windows at different locations on the input.

In that scenario, the "fully connected layers" really act as 1x1 convolutions.

NorbertZheng commented 1 year ago

Implementation

Having talked about the concept, time to see some implementation, which is quite easy to follow with some basic PyTorch experience.

class OurConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.projection = None
        # Input dims expected: HxWxC = 36x36x3

        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1)
        # Output dims: HxWxC = 36x36x32

        # softconv is the 1x1 convolution: Filter dimensions go from 32 -> 1 implies Output dims: HxWxC = 36x36x1
        self.softconv = nn.Conv2d(in_channels=32, out_channels=1, kernel_size=1, stride=1, padding=0)

    def forward(self, input_data):
        # Apply convolution
        x = self.conv1(input_data)
        # Apply tanh activation
        x = torch.tanh(x)
        # Apply the 1x1 convolution
        x = self.softconv(x)
        # Apply sigmoid activation
        x = torch.sigmoid(x)
        # Save the result in projection
        self.projection = x
NorbertZheng commented 1 year ago

Here, I’ve used softconv to denote the 1x1 convolution. This is a code snippet from a recent project, where the 1x1 convolution was used for projecting the information across the filter dimension (32 in this case) and pooling it into a single dimension. This brought three benefits to my use-case:

NorbertZheng commented 1 year ago

To drive home the idea of possibly using a 1x1 convolution, I am providing an example use-case from a model trained on the DEAP emotion dataset without delving into many details. The model is trained to predict heart rate signals from facial videos (images). Here, the pooling of information from 32 filters (obtained from previous convolutions) using the 1x1 convolutional layer into a single channel

This is a work in progress but hopefully one can see the point of using a 1x1 convolution here.

image A facial image being used as input. Image from DEAP dataset.

image State of 32 filters after applying the regular convolutional layers. Image by Author.

image The output as obtained from the softconv layer where 32 filters have been pooled into a single channel. Image by Author.

NorbertZheng commented 1 year ago

Key Takeaways:

NorbertZheng commented 1 year ago

References: