Closed NorbertZheng closed 1 year ago
Inthis story, An improved ResNet [1] by Microsoft is reviewed. With Identity Mapping, over 1000 layers can be reached for the deep learning architecture, without error increased.
In the previous version of ResNet [2], when ResNet goes from 101 layers to 1202 layers, though ResNet-1202 still can converge, there is degradation of error rate from 6.43% to 7.93% (This result can be seen in [2]). And it is stated as open question without any explanation in [2].
The following figure shows the results of ResNet with Identity Mapping. With layers are up to 1001, previous ResNet [2] only got 7.61% error while new ResNet with Identity Mapping [1] can get 4.92% for CIFAR-10 Dataset.
(a) Previous ResNet [2] (7.61%) (b) New ResNet with Identity Mapping [1] (4.92%) for CIFAR-10 Dataset.
But
In this paper, it is well-explained. And a series of ablation study are done to support the importance of this identity mapping.
The result is even better than Inception-v3 [3], #100. (If interested, please also read my Inception-v3 review.) With such good result, it is published in 2016 ECCV paper with more than 1000 citations when I was writing this story.
The forward feeding, backpropagation and gradient updates which seems to make the deep learning as a secret. I think the explanation here is excellent.
In ResNet with Identity Mapping, it is essential to
$x_{l}$ is the input at $l$ layer, $F(.)$ is the function that represents the conv layers, BN and ReLU. Then we can formulate like this: One Particular Layer.
L layers from l-th layer.
We can see that the input signal $x_{l}$ is still kept here!
During backpropagation, we can get the gradient which decomposed into two additive terms: Gradient which decomposed into two additive terms.
Inside the blanket, we can always get “1” at the left term no matter how deep the network. And the right term cannot be always -1 which makes the gradient zero. Thus, the gradient does not vanish!!
On the other hand, what if the left term is not equal to one: One Particular Layer.
L layers from $l$-th layer.
Gradient which decomposed into two additive terms.
Similarly, the left term of gradient is the product of $\lambda$.
If $\lambda>1$, the left term will be exponentially large, and gradient exploding problem occurs. As we should remember, when the gradient exploded, the loss cannot be converged.
If $\lambda<1$, the left term will be exponentially small, and gradient vanishing problem occurs. We cannot update the gradient with large value, the loss stays at plateau and end up converged with large loss.
Thus, that’s why we need to keep clean for the shortcut connection path from input to output without any conv layers, BN and ReLU.
110-layer ResNet (54 two-layer residual units) with various types of shortcut connections are tested on CIFAR-10 dataset as below:
Performance of Various Types of Shortcut Connections.
The following results are obtained by playing around the positions of the BN and ReLU: Performance of Various Usages of Activation.
Previous ResNet structure (Baseline) vs Pre-activation Unit.
Using previous ResNet structure (Baseline) has worse results when going too deep (1001) due to the wrong position of ReLU layer. Using pre-activation unit can always get a better result when the network is going deeper and deeper from 110 to 1001.
Training Error vs Iterations.
CIFAR-10 & CIFAR-100 Results.
For CIFAR-10, Using ResNet-1001 with proposed pre-activation unit (4.62%), is even better than ResNet-1202 (7.93%) using previous version of ResNet, with 200 layers fewer.
For CIFAR-100, Using ResNet-1001 with proposed pre-activation unit (22.71%), is even better than ResNet-1001 (27.82%) using previous version of ResNet.
For both CIFAR-10 & CIFAR-100, ResNet-1001 with proposed pre-activation unit does not have larger error than ResNet-164, but the previous ResNet [2] does.
On CIFAR-10, ResNet-1001 takes about 27 hours to train with 2 GPUs.
ILSVRC Image Classification Results.
With only scale augmentation, the previous version of ResNet-152 (5.5%), the winner of ILSVRC 2015, has worse performance than the previous version of ResNet-200 (6.0%) when going deeper due to the wrong position of ReLU.
And the proposed ResNet-200 with Pre-Activation (5.3%) have better results than the previous ResNet-200 (6.0%).
With both scale and aspect ratio augmentation, the proposed ResNet-200 with Pre-Activation (4.8%) is better than Inception-v3 [3] by Google (5.6%).
Concurrently, Google also has a Inception-ResNet-v2 which has 4.9% error, with pre-activation unit, the error is expected to be have further reduction.
On ILSVRC, ResNet-200 takes about 3 weeks to train on 8 GPUs.
Sik-Ho Tang. Review: Pre-Activation ResNet with Identity Mapping — Over 1000 Layers Reached (Image Classification).