Sik-Ho Tang | Review: ResNet -- Winner of ILSVRC 2015 (Image Classification, Localization, Detection).

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: ResNet — Winner of ILSVRC 2015 (Image Classification, Localization, Detection).

NorbertZheng commented 1 year ago

Overview

In this story, ResNet [1] is reviewed. ResNet can have a very deep network of up to 152 layers by

learning the residual representation functions instead of learning the signal representation directly.

ResNet introduces skip connection (or shortcut connection) to fit the input from the previous layer to the next layer without any modification of the input. Skip connection enables to have deeper network and finally ResNet becomes the Winner of ILSVRC 2015 in image classification, detection, and localization, as well as Winner of MS COCO 2015 detection, and segmentation. This is a 2016 CVPR paper with more than 19000 citations.

ILSVRC 2015 Image Classification Ranking.

NorbertZheng commented 1 year ago

Dataset

ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images and 100,000 testing images.

NorbertZheng commented 1 year ago

Problems of Plain Network

For conventional deep learning networks, they usually have conv layers then fully connected (FC) layers for classification task like AlexNet, ZFNet and VGGNet, without any skip / shortcut connection, we call them plain networks here.

When the plain network is deeper (layers are increased), the problem of vanishing/exploding gradients occurs.

Vanishing / Exploding Gradients

During backpropagation, when partial derivative of the error function with respect to the current weight in each iteration of training, this has the effect of

multiplying $n$ of these small/large numbers to compute gradients of the “front” layers in an $n$-layer network.

When the network is deep, and multiplying $n$ of these small numbers will become zero (vanished).

When the network is deep, and multiplying $n$ of these large numbers will become too large (exploded).

We expect

deeper network will have more accurate prediction.

However, below shows an example, 20-layer plain network got lower training error and test error than 56-layer plain network, a degradation problem occurs due to vanishing gradients.

Plain Networks for CIFAR-10 Dataset.

NorbertZheng commented 1 year ago

Skip / Shortcut Connection in Residual Network (ResNet)

To solve the problem of vanishing/exploding gradients, a skip / shortcut connection is added to add the input $x$ to the output after few weight layers as below: A Building Block of Residual Network.

Hence, the output $H(x)= F(x) + x$. The weight layers actually is to learn a kind of residual mapping: $F(x)=H(x)-x$.

Even if there is vanishing gradient for the weight layers, we always still have the identity $x$ to transfer back to earlier layers.

NorbertZheng commented 1 year ago

ResNet Architecture

34-layer ResNet with Skip / Shortcut Connection (Top), 34-layer Plain Network (Middle), 19-layer VGG-19 (Bottom).

The above figure shows the ResNet architecture.

The VGG-19 [2] (bottom) is a state-of-the-art approach in ILSVRC 2014.
34-layer plain network (middle) is treated as the deeper network of VGG-19, i.e. more conv layers.
34-layer residual network (ResNet) (top) is the plain one with addition of skip / shortcut connection.

For ResNet, there are 3 types of skip / shortcut connections when the input dimensions are smaller than the output dimensions.

Shortcut performs identity mapping, with extra zero padding for increasing dimensions. Thus, no extra parameters.
The projection shortcut is used for increasing dimensions only, the other shortcuts are identity. Extra parameters are needed.
All shortcuts are projections. Extra parameters are more than that of (B).

NorbertZheng commented 1 year ago

Bottleneck Design

Since the network is very deep now, the time complexity is high. A bottleneck design is used to reduce the complexity as follows: The Basic Block (Left) and The Proposed Bottleneck Design (Right).

The 1×1 conv layers are added to the start and end of network as in the figure (right). This is a technique suggested in Network In Network and GoogLeNet (Inception-v1).

It turns out that 1×1 conv can reduce the number of connections (parameters) while not degrading the performance of the network so much. (Please visit my review if interested.)

With the bottleneck design, 34-layer ResNet become 50-layer ResNet. And there are deeper network with the bottleneck design: ResNet-101 and ResNet-152. The overall architecture for all network is as below: The overall architecture for all network.

It is noted that VGG-16/19 has 15.3/19.6 billion FLOPS. ResNet-152 still has lower complexity than VGG-16/19!!!!

NorbertZheng commented 1 year ago

Ablation Study

Plain Network VS ResNet

Validation Error: 18-Layer and 34-Layer Plain Network (Left), 18-Layer and 34-Layer ResNet (right).

Top-1 Error Using 10-Crop Testing.

When plain network is used, 18-layer is better than 34-layer, due to the vanishing gradient problem.

When ResNet is used, 34-layer is better than 18-layer, vanishing gradient problem has been solved by skip connections.

If we compare 18-layer plain network and 18-layer ResNet, there is no much difference. This is because vanishing gradient problem does not appear for shallow network.

NorbertZheng commented 1 year ago

Other Settings

Batch Normalization (from Inception-v2, #96) is used after each conv.
10-crop (i.e. data augmentattaion, #83) testing is used.
And fully convolutional form with averaging the scores at multiple scales {224, 256, 384, 480, 640} (i.e. specify the size of output to fixed value, #88) is adopted.
6 models are used for ensemble boosting, #83.

These are some techniques used in previous deep learning framework. If interested, please also feel free to read my reviews.

NorbertZheng commented 1 year ago

Comparison with State-of-the-art Approaches (Image Classification)

ILSVRC

10-Crop Testing Results.

By comparing ResNet-34 A ,B, and C, B is slightly better than A and C is marginally better than B because extra parameters are introduced with all obtain around 7% error rate.

By increasing the network depth to 152 layers, 5.71% top-5 error rate is obtained which is much better than VGG-16 #90, GoogLeNet (Inception-v1) #95, and PReLU-Net #92.

10-Crop Testing + Fully Conv with Multiple Scale Results.

With 10-Crop Testing + Fully Conv with Multiple, ResNet-152 can obtain 4.49% error rate.

10-Crop Testing + Fully Conv with Multiple Scale + 6-Model Ensemble Results.

Added with 6-model ensemble technique, the error rate is 3.57%.

NorbertZheng commented 1 year ago

CIFAR-10

CIFAR-10 Results.

With skip connection, we can go deeper. However, when the number of layers is going from 110 to 1202, it is found that the error rate is increased from 6.43% to 7.93% and as an open question in the paper.

Nevertheless, ResNet-1202 does not have optimization difficulty, i.e. it’s still can be converged.

NorbertZheng commented 1 year ago

Comparison with State-of-the-art Approaches (Object Detection)

PASCAL VOC 2007/2012 mAP (%).

MS COCO mAP (%).

By adopting the ResNet-101 into Faster R-CNN [3–4], i.e. #98, ResNet obtains better performance than VGG-16 by large margin.

And ResNet finally won the 1st places on ImageNet Detection, Localization, COCO Detection and COCO Segmentation!!!

NorbertZheng commented 1 year ago

References

[2016 CVPR] [ResNet] Deep Residual Learning for Image Recognition.
[2015 ICLR] [VGGNet] Very Deep Convolutional Networks for Large-Scale Image Recognition.
[2015 NIPS] [Faster R-CNN] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
[2017 TPAMI] [Faster R-CNN] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

NorbertZheng / read-papers