Closed NorbertZheng closed 1 year ago
Using Convolution Layers With 1×1 Convolution Kernels. A few example images from the CIFAR10 dataset.
Inthis story, Network In Network (NIN), by Graduate School for Integrative Sciences and Engineering and National University of Singapore, is briefly reviewed.
This is a 2014 ICLR paper with more than 2300 citations.
Linear Convolutional Layer.
$$ f{i,j,k}=max(w{k}^{T}x_{i,j},0). $$
MLP Convolutional Layer.
An Example of Fully Connected Layer VS Global Average Pooling Layer.
Global Average Pooling Layer is used to remove spatial translation along the convolution axises, where Conv*D Layer is used to remove spatial translation along the channel axis!!!
Enforcing correspondences between feature maps and categories means that feature maps are not correlated to contribute the final classification, leading to disentangled representation!!!
Overall Structure of Network In Network (NIN).
Error Rates on CIFAR-10 Test Set.
As shown above, introducing dropout layers in between the MLP Convolutional Layers reduced the test error by more than 20%.
Error Rates on CIFAR-100 Test Set.
Similarly, NIN + Dropout got only 35.68% error rate which is better than Maxout + Dropout.
Error Rates on SVHN Test Set.
However, NIN + Dropout got 2.35% error rate which is worse than DropConnect.
Error Rates on MNIST Test Set.
In MNIST, NIN + Dropout got 0.47% error rate which is worse than Maxout + Dropout a bit.
Error Rates on CIFAR-10 Test Set.
With Global Average Pooling, NIN got 10.41% error rate which is better than fully connected + dropout of 10.88%.
In NIN, with 1×1 convolution, more non-linearity is introduced which makes the error rate lower.
[2014 ICLR] [NIN] Network In Network.
Anwesh Marwade. 1x1 Convolution: Demystified.
Shedding light on the concept of 1x1 convolution operation which appears in paper, Network in Network by Lin et al. and Google Inception.
Having read the Network in Network (NiN) paper by Lin et al a while ago, I came across
They called it the “cross channel parametric pooling layer” (if I remember correctly), comparing it to an operation involving convolution with a 1x1 convolutional kernel.
Skimming over the details at the time (as I often do with such esoteric terminology), I never thought I would be writing about this operation, let alone providing my own thoughts on its workings. But as it goes, it's usually the terminology that seems formidable and not the so much the concept itself; which is quite useful! Having completed the back-story and a pause for effect, let us demystify this peculiar but multi-purpose, 1x1 convolutional layer.
As the name suggests, the 1x1 convolution operation involves convolving the input with filters of size 1x1, usually with zero-padding and stride of 1.
Taking an example, let us suppose we have a (general-purpose) convolutional layer which outputs a tensor of shape $(B, K, H, W)$ where,
In addition, we specify a filter size that we want to work with, which is a single number for a square filter i.e. size=3 implies a 3x3 filter.
Feeding this tensor into our 1x1 convolution layer with $F$ filters (zero-padding and stride 1), we will get an output of shape $(B, F, H, W)$ changing our filter dimension from $K$ to $F$. Sweet!
Now depending on whether $F$ is less or greater than $K$, we have either decreased or increased the dimensionality of our input in the filter space without applying any spatial transformation (you’ll see). All this using the 1x1 convolution operation!
But wait, how was this any different from a regular convolution operation? In a regular convolution operation, we usually have a larger filter size like say, a 3x3 or 5x5 (or even 7x7) kernels which then generally entail some kind of padding to the input which in turn transforms it’s spatial dimensions of $H\times W$ to some $H'\times W’$; capisce? If not, here is the link for my go-to article for any (believe me) clarification on Convolutional Nets and its operations.
In CNNs, we often use some kind of pooling operations to
The danger of intractability is due to the number of generated activation maps which increases or rather blows up dramatically, proportional to the depth of a CNN. That is, the deeper a network, the larger the number of activation maps it generates. The problem is further exacerbated if the convolution operation is using large-sized filters like 5x5 or 7x7 filters, resulting in a significantly high number of parameters.
Understand: A Filter refer to the kernel that is being applied over the input in a sliding window fashion as a part of a convolution operation. An Activation Map on the other hand is the output result of applying a single filter over the input image. A convolution operation with multiple filters usually generates multiple (stacked) activation maps.
While it maintains important spatial features to some extent, there does exist a trade-off between down-sampling and information loss. Bottom line: we can only apply pooling to a certain extent.
This was heavily used in Google’s inception architecture (link in references) where they state the following:
One big problem with the above modules, at least in this naive form, is that even a modest number of 5x5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters.
This leads to the second idea of the proposed architecture: judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose.
They introduced the use of 1x1 convolutions to compute reductions before the expensive 3x3 and 5x5 convolutions. Instead of spatial dimensionality reduction using pooling, reduction may be applied in the filter dimension using 1x1 convolutions.
An interesting line of thought was provided by Yann LeCun where he analogizes fully connected layers in CNNs as simply convolution layers with 1x1 convolution kernels and a full connection table. See this post from 2015:
In Convolutional Nets, there is no such thing as "fully-connected layers". There are only convolution layers with 1x1 convolution kernels and a full connection table.
It's a too-rarely-understood fact that ConvNets don't need to have a fixed-size input. You can train them on inputs that happen to produce a single output vector (with no spatial extent), and then apply them to larger images. Instead of a single output vector, you then get a spatial map of output vectors. Each vector sees input windows at different locations on the input.
In that scenario, the "fully connected layers" really act as 1x1 convolutions.
Having talked about the concept, time to see some implementation, which is quite easy to follow with some basic PyTorch experience.
class OurConvNet(nn.Module):
def __init__(self):
super().__init__()
self.projection = None
# Input dims expected: HxWxC = 36x36x3
self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1)
# Output dims: HxWxC = 36x36x32
# softconv is the 1x1 convolution: Filter dimensions go from 32 -> 1 implies Output dims: HxWxC = 36x36x1
self.softconv = nn.Conv2d(in_channels=32, out_channels=1, kernel_size=1, stride=1, padding=0)
def forward(self, input_data):
# Apply convolution
x = self.conv1(input_data)
# Apply tanh activation
x = torch.tanh(x)
# Apply the 1x1 convolution
x = self.softconv(x)
# Apply sigmoid activation
x = torch.sigmoid(x)
# Save the result in projection
self.projection = x
Here, I’ve used softconv to denote the 1x1 convolution. This is a code snippet from a recent project, where the 1x1 convolution was used for projecting the information across the filter dimension (32 in this case) and pooling it into a single dimension. This brought three benefits to my use-case:
To drive home the idea of possibly using a 1x1 convolution, I am providing an example use-case from a model trained on the DEAP emotion dataset without delving into many details. The model is trained to predict heart rate signals from facial videos (images). Here, the pooling of information from 32 filters (obtained from previous convolutions) using the 1x1 convolutional layer into a single channel
This is a work in progress but hopefully one can see the point of using a 1x1 convolution here.
A facial image being used as input. Image from DEAP dataset.
State of 32 filters after applying the regular convolutional layers. Image by Author.
The output as obtained from the softconv layer where 32 filters have been pooled into a single channel. Image by Author.
Sik-Ho Tang. Review: NIN — Network In Network (Image Classification).