Sik-Ho Tang | Review: PReLU-Net -- The First to Surpass Human-Level Performance in ILSVRC 2015 (Image Classification).

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: PReLU-Net — The First to Surpass Human-Level Performance in ILSVRC 2015 (Image Classification).

NorbertZheng commented 1 year ago

Overview

In this story, PReLU-Net [1] is reviewed. Parametric Rectified Linear Unit (PReLU) is proposed to generalize the traditional rectified unit (ReLU). This is the deep learning approach which is first to surpass the human-level performance in ILSVRC (ImageNet Large Scale Visual Recognition Challenge) image classification. In addition, a better weight initialization for rectifiers is proposed which can help with the convergence of deep model (30 layers) trained directly from scratch.

Finally, PReLU-Net obtains 4.94% top-5 error rate on test set which is better than the human-level performance of 5.1%, and GoogLeNet of 6.66% !!!

And this is the paper in 2015 ICCV and it got about 3000 citations at the moment I am writing this story.

NorbertZheng commented 1 year ago

Dataset

Classification: Over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3M/50k/100k images are used for the training/validation/testing sets.

NorbertZheng commented 1 year ago

Parametric Rectified Linear Unit (PReLU)

In AlexNet [2], ReLU is suggested as below where only positive values would pass through the ReLU activation function while all negative values are set to zero. And ReLU outperform Tanh with much faster training speed due to non-saturation at 1.

PReLU (LeakyReLU?)

PReLU is suggested that there should be penalty for negative values and it should be parametric.

It is noted that when a = 0, it is ReLU. When a = 0.01, it is Leaky ReLU. Now the value of a can be learnt, therefore becoming a generalized ReLU.

During backpropagation, we can estimate the gradient: Backpropagation, gradient from deep layer (Left), gradient of the activation (Right).

We can estimate the gradient from the deep layer (left), and the gradient of the activation (right). We can see that it is the sum of all positions of the feature map (Channel-wise). If it is channel-shared variant, it is the sum all over the channels of the layer. No weight decay is applied to a.

The average value of a over all channels for each layer.

Two interesting phenomena observed:

First, the first conv layer (conv1) has coefficients (0.681 and 0.596) significantly greater than 0. As the filters of conv1 are mostly Gabor-like filters such as edge or texture detectors, the learned results show that both positive and negative responses of the filters are respected.
For the channel-wise version, the deeper conv layers in general have smaller coefficients. the activations gradually become “more nonlinear” at increasing depths. In other words, the learned model tends to keep more information in earlier stages and becomes more discriminative in deeper stages.

NorbertZheng commented 1 year ago

A better weight initialization for rectifiers

A good weight initialization is essential to not to let the network to reduce or magnifying the input signals exponentially.

Since weight initialization depends on Gaussian distribution with mean 0 and different variance depending on the algorithm like Xavier. Thus, by considering the input and output network sizes, a better weight initialization is suggested.

With $L$ layers putting together, the variance is (left): Variance of L layers (Left), and sufficient condition (right).

If the sufficient condition at the right is met, the network can become stable. Thus, finally, the variance should be $\frac{2}{n{l}}$ where $n{l}$ is the number of connection in $l$-th layer.

(There is also proof at backward propagation case. It is also interesting that it also comes up with the same sufficient condition. But I just not to show it here.)

Red (Ours) and Blue (Xavier), 22-layer (Left) and 30-layer (Right).

As shown above, the suggested weight initialization converges faster. And Xavier even cannot converge for the deeper layer at the right when training from scratch.

NorbertZheng commented 1 year ago

22-layer deep learning models

PReLU-Net: Model A, B, C.

SPP layer: 4-level SPPNet [3–4] {7×7, 3×3, 2×2, 1×1}
Model A: A model with better result than VGG-19 [5]
Model B: Deep model than Model A
Model C: Wider model (More filters) than Model B

Model A using PReLU is better than the one using ReLU.

Model A: PReLU converges faster.

NorbertZheng commented 1 year ago

Comparison with state-of-the-art approaches

Single model, 10-view.

By using just single model and 10-view (data augmentation), Model C has 7.38% error rate.

Single model, Multi-view, Multi-scale.

With multi-view and multi-scale, Model C has 5.71% error rate. This result is already better than even the multi model of SPPNet [3–4], VGGNet [5] and GoogLeNet [6].

Multi-model, Multi-view, Multi-scale.

With multi-model, i.e. 6 models PReLU-Net, got the 4.94% error rate. This is 26% relative improvement against GoogLeNet!!!

NorbertZheng commented 1 year ago

Object detection by using Fast R-CNN

PReLU-Net uses Fast R-CNN [7] implementation for object detection in PASCAL VOC 2007 dataset. Model C + PReLU-Net has the best mAP result.

With imagenet pretrained model and fine-tuning on VOC 2007 dataset, Model C obtains better results than VGG-16.

NorbertZheng commented 1 year ago

References

[2015 ICCV] [PReLU-Net] Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.
[2012 NIPS] [AlexNet] ImageNet Classification with Deep Convolutional Neural Networks.
[2014 ECCV] [SPPNet] Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.
[2015 TPAMI] [SPPNet] Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.
[2015 ICLR] [VGGNet] Very Deep Convolutional Networks for Large-Scale Image Recognition.
[2015] [CVPR] [GoogLeNet] Going Deeper with Convolutions.
[2015 ICCV] [Fast R-CNN] Fast R-CNN.

NorbertZheng / read-papers