NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tang | Review: Batch Normalization (Inception-v2 / BN-Inception) -- The 2nd to Surpass Human-Level Performance in ILSVRC 2015 (Image Classification). #96

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: Batch Normalization (Inception-v2 / BN-Inception) — The 2nd to Surpass Human-Level Performance in ILSVRC 2015 (Image Classification).

NorbertZheng commented 1 year ago

Overview

In this story, Inception-v2 [1] by Google is reviewed. This approach introduces a very essential deep learning technique called Batch Normalization (BN).

With BN, higher accuracy and faster training speed can be achieved.

The ILSVRC (ImageNet Large Scale Visual Recognition Competition) competition in 2015 has become intense!

On 6 Feb 2015, Microsoft has proposed PReLU-Net [2] which has 4.94% error rate which surpasses the human error rate of 5.1%.

Five days later, on 11 Feb 2015, Google proposed BN-Inception / Inception-v2 [1] in arXiv (NOT submission to ILSVRC) which has 4.8% error rate.

Though BN did not take part in the ILSVRC competition, BN has a very good concept which has been used for many networks afterwards. And it is a 2015 ICML paper with over 6000 citations at the time I was writing this story. This is a must read item in deep learning.

NorbertZheng commented 1 year ago

Dataset

ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images and 100,000 testing images.

NorbertZheng commented 1 year ago

About The Inception Versions

There are 4 versions. The first GoogLeNet must be the Inception-v1 [3], but there are numerous typos in Inception-v3 [4] which lead to wrong descriptions about Inception versions. Consequently, there are many reviews in the internet mixing up between v2 and v3. Some of the reviews even think that v2 and v3 are the same with only some minor different settings.

Nevertheless, in Inception-v4 [5], Google has a much more clear description about the version issue:

Thus, when we talk about Batch Normalization (BN), we are talking about Inception-v2 or BN-Inception.

NorbertZheng commented 1 year ago

Why we need Batch Normalization (BN)?

As we should know, the input $X$ is multiplied by weight $W$ and added by bias $b$ and become the output $Y$ at the next layer after an activation function $F$:

$$ Y=F(W\cdot X+b), $$

ReLU is then used as $F$, where $ReLU(x)=max(x,0)$, to address the saturation problem and the resulting vanishing gradients.

image Without BN (Left), With BN (Right).

BN can reduce the dependence of gradients on the scale of the parameters or their initial values. As a result,

NorbertZheng commented 1 year ago

Batch Normalization (BN)

image Batch Normalization.

During training, we estimate the mean $\mu$ and variance $\sigma^{2}$ of the mini-batch as shown above. And the input is normalized by subtracting the mean $\mu$ and dividing it by the standard deviation $\sigma$. (The epsilon $\epsilon$ is to prevent denominator from being zero) And additional learnable parameters $\gamma$ and $\beta$ are used for scale and shift to have a better shape and position after normalization. And output $Y$ becomes as follows:

$$ Y=F(BN(W\cdot X+b)), $$

To have a more precise mean and variance, moving average is used to calculate the mean and variance.

During testing, the mean and variance are calculated using the population.

NorbertZheng commented 1 year ago

Ablation Study

MNIST dataset

28×28 binary image as input, 3 FC hidden layer with 100 activations each, the last hidden layer followed by 10 activations as there are 10 digits. And the loss is cross entropy loss.

image (a) Accuracy: With BN (Blue), Without BN (Black dotted), (b) and (c) One typical activation from last layer.

BN network is much more stable.

Applying BN to GoogLeNet (Inception-v1)

Besides applying BN to Inception-v1 [3], the main difference is that the 5×5 convolutional layers are replaced by two consecutive layers of 3×3 convolutions with up to 128 filters. This is a kind of factorization mentioned in Inception-v3 [4]. image Single Crop Accuracy.

From the above figure, there are many settings tested:

Inception: Inception-v1 without BN BN-Baseline: Inception with BN BN-×5: Initial learning rate is increased by a factor of 5 to 0.0075 BN-×30: Initial learning rate is increased by a factor of 30 to 0.045 BN-×5-Sigmoid: BN-×5 but with Sigmoid

By comparing Inception and BN-Baseline, we can see that using BN can improve the training speed significantly.

By observing BN-×5 and BN-×30, we can see that the initial learning rate can be increased largely to improve the training speed even better.

And by observing BN-×5-Sigmoid, we can see that saturation problem by Sigmoid can be a kind of removed.

NorbertZheng commented 1 year ago

Comparison with the State-of-the-art Approaches

image Comparison with the State-of-the-art Approaches.

GoogLeNet (Inception-v1) is the winner in ILSVRC 2014 which has 6.67% error rate.

Deep Image from Baidu, is submitted on 13 Jan 2015, of 5.98% error rate, and with the best error rate of 4.58% with later submissions. Deep Image network is something like VGGNet without any surprise, but it proposed hardware/software co-adaptation which can have up to 64 GPU to increase the batch size up to 1024. (But due to frequent submissions which violated the rule of competition, Baidu was banned for the duration of 1 year. And they also withdrew their paper.)

PReLU-Net from Microsoft, is submitted on 6 Feb 2015, with 4.94% error rate which is the first to surpass human-level performance.

Inception-v2 / BN-Inception, is reported on 11 Feb 2015, has 4.82% error rate which has the best result in this paper.

NorbertZheng commented 1 year ago

References