NorbertZheng / read-papers

My paper reading notes.
MIT License
8 stars 0 forks source link

Sik-Ho Tang | Review: Pre-Activation ResNet with Identity Mapping -- Over 1000 Layers Reached (Image Classification). #102

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: Pre-Activation ResNet with Identity Mapping — Over 1000 Layers Reached (Image Classification).

NorbertZheng commented 1 year ago

Overview

Inthis story, An improved ResNet [1] by Microsoft is reviewed. With Identity Mapping, over 1000 layers can be reached for the deep learning architecture, without error increased.

In the previous version of ResNet [2], when ResNet goes from 101 layers to 1202 layers, though ResNet-1202 still can converge, there is degradation of error rate from 6.43% to 7.93% (This result can be seen in [2]). And it is stated as open question without any explanation in [2].

The following figure shows the results of ResNet with Identity Mapping. With layers are up to 1001, previous ResNet [2] only got 7.61% error while new ResNet with Identity Mapping [1] can get 4.92% for CIFAR-10 Dataset.

image (a) Previous ResNet [2] (7.61%) (b) New ResNet with Identity Mapping [1] (4.92%) for CIFAR-10 Dataset.

But

In this paper, it is well-explained. And a series of ablation study are done to support the importance of this identity mapping.

The result is even better than Inception-v3 [3], #100. (If interested, please also read my Inception-v3 review.) With such good result, it is published in 2016 ECCV paper with more than 1000 citations when I was writing this story.

NorbertZheng commented 1 year ago

Explanations of the Importance of Identity Mapping

The forward feeding, backpropagation and gradient updates which seems to make the deep learning as a secret. I think the explanation here is excellent.

Feed Forward

In ResNet with Identity Mapping, it is essential to

$x_{l}$ is the input at $l$ layer, $F(.)$ is the function that represents the conv layers, BN and ReLU. Then we can formulate like this: image One Particular Layer.

image L layers from l-th layer.

We can see that the input signal $x_{l}$ is still kept here!

NorbertZheng commented 1 year ago

Backpropagation

During backpropagation, we can get the gradient which decomposed into two additive terms: image Gradient which decomposed into two additive terms.

Inside the blanket, we can always get “1” at the left term no matter how deep the network. And the right term cannot be always -1 which makes the gradient zero. Thus, the gradient does not vanish!!

NorbertZheng commented 1 year ago

Backpropagation When Identity Mapping Is Violated

On the other hand, what if the left term is not equal to one: image One Particular Layer.

image L layers from $l$-th layer.

image Gradient which decomposed into two additive terms.

Similarly, the left term of gradient is the product of $\lambda$.

If $\lambda>1$, the left term will be exponentially large, and gradient exploding problem occurs. As we should remember, when the gradient exploded, the loss cannot be converged.

If $\lambda<1$, the left term will be exponentially small, and gradient vanishing problem occurs. We cannot update the gradient with large value, the loss stays at plateau and end up converged with large loss.

Thus, that’s why we need to keep clean for the shortcut connection path from input to output without any conv layers, BN and ReLU.

NorbertZheng commented 1 year ago

Ablation Study

Various types of shortcut connections

110-layer ResNet (54 two-layer residual units) with various types of shortcut connections are tested on CIFAR-10 dataset as below:

image Performance of Various Types of Shortcut Connections.

NorbertZheng commented 1 year ago

Various Usages of Activation

The following results are obtained by playing around the positions of the BN and ReLU: image Performance of Various Usages of Activation.

NorbertZheng commented 1 year ago

Advantages of Pre-activation in Twofold

Ease of Optimization

image Previous ResNet structure (Baseline) vs Pre-activation Unit.

Using previous ResNet structure (Baseline) has worse results when going too deep (1001) due to the wrong position of ReLU layer. Using pre-activation unit can always get a better result when the network is going deeper and deeper from 110 to 1001.

Reducing Overfitting

image Training Error vs Iterations.

NorbertZheng commented 1 year ago

Comparison with State-of-the-art Approaches

CIFAR-10 & CIFAR-100

image CIFAR-10 & CIFAR-100 Results.

For CIFAR-10, Using ResNet-1001 with proposed pre-activation unit (4.62%), is even better than ResNet-1202 (7.93%) using previous version of ResNet, with 200 layers fewer.

For CIFAR-100, Using ResNet-1001 with proposed pre-activation unit (22.71%), is even better than ResNet-1001 (27.82%) using previous version of ResNet.

For both CIFAR-10 & CIFAR-100, ResNet-1001 with proposed pre-activation unit does not have larger error than ResNet-164, but the previous ResNet [2] does.

On CIFAR-10, ResNet-1001 takes about 27 hours to train with 2 GPUs.

NorbertZheng commented 1 year ago

ILSVRC

image ILSVRC Image Classification Results.

With only scale augmentation, the previous version of ResNet-152 (5.5%), the winner of ILSVRC 2015, has worse performance than the previous version of ResNet-200 (6.0%) when going deeper due to the wrong position of ReLU.

And the proposed ResNet-200 with Pre-Activation (5.3%) have better results than the previous ResNet-200 (6.0%).

With both scale and aspect ratio augmentation, the proposed ResNet-200 with Pre-Activation (4.8%) is better than Inception-v3 [3] by Google (5.6%).

Concurrently, Google also has a Inception-ResNet-v2 which has 4.9% error, with pre-activation unit, the error is expected to be have further reduction.

On ILSVRC, ResNet-200 takes about 3 weeks to train on 8 GPUs.

NorbertZheng commented 1 year ago

References