NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tang | Review: PReLU-Net -- The First to Surpass Human-Level Performance in ILSVRC 2015 (Image Classification). #92

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: PReLU-Net — The First to Surpass Human-Level Performance in ILSVRC 2015 (Image Classification).

NorbertZheng commented 1 year ago

Overview

In this story, PReLU-Net [1] is reviewed. Parametric Rectified Linear Unit (PReLU) is proposed to generalize the traditional rectified unit (ReLU). This is the deep learning approach which is first to surpass the human-level performance in ILSVRC (ImageNet Large Scale Visual Recognition Challenge) image classification. In addition, a better weight initialization for rectifiers is proposed which can help with the convergence of deep model (30 layers) trained directly from scratch.

Finally, PReLU-Net obtains 4.94% top-5 error rate on test set which is better than the human-level performance of 5.1%, and GoogLeNet of 6.66% !!!

And this is the paper in 2015 ICCV and it got about 3000 citations at the moment I am writing this story.

NorbertZheng commented 1 year ago

Dataset

Classification: Over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3M/50k/100k images are used for the training/validation/testing sets.

NorbertZheng commented 1 year ago

Parametric Rectified Linear Unit (PReLU)

In AlexNet [2], ReLU is suggested as below where only positive values would pass through the ReLU activation function while all negative values are set to zero. And ReLU outperform Tanh with much faster training speed due to non-saturation at 1.

image PReLU (LeakyReLU?)

It is noted that when a = 0, it is ReLU. When a = 0.01, it is Leaky ReLU. Now the value of a can be learnt, therefore becoming a generalized ReLU.

During backpropagation, we can estimate the gradient: image Backpropagation, gradient from deep layer (Left), gradient of the activation (Right).

We can estimate the gradient from the deep layer (left), and the gradient of the activation (right). We can see that it is the sum of all positions of the feature map (Channel-wise). If it is channel-shared variant, it is the sum all over the channels of the layer. No weight decay is applied to a.

image The average value of a over all channels for each layer.

Two interesting phenomena observed:

NorbertZheng commented 1 year ago

A better weight initialization for rectifiers

Since weight initialization depends on Gaussian distribution with mean 0 and different variance depending on the algorithm like Xavier. Thus, by considering the input and output network sizes, a better weight initialization is suggested.

With $L$ layers putting together, the variance is (left): image Variance of L layers (Left), and sufficient condition (right).

If the sufficient condition at the right is met, the network can become stable. Thus, finally, the variance should be $\frac{2}{n{l}}$ where $n{l}$ is the number of connection in $l$-th layer.

(There is also proof at backward propagation case. It is also interesting that it also comes up with the same sufficient condition. But I just not to show it here.)

image Red (Ours) and Blue (Xavier), 22-layer (Left) and 30-layer (Right).

As shown above, the suggested weight initialization converges faster. And Xavier even cannot converge for the deeper layer at the right when training from scratch.

NorbertZheng commented 1 year ago

22-layer deep learning models

image PReLU-Net: Model A, B, C.

image Model A using PReLU is better than the one using ReLU.

image Model A: PReLU converges faster.

NorbertZheng commented 1 year ago

Comparison with state-of-the-art approaches

image Single model, 10-view.

By using just single model and 10-view (data augmentation), Model C has 7.38% error rate.

image Single model, Multi-view, Multi-scale.

With multi-view and multi-scale, Model C has 5.71% error rate. This result is already better than even the multi model of SPPNet [3–4], VGGNet [5] and GoogLeNet [6].

image Multi-model, Multi-view, Multi-scale.

With multi-model, i.e. 6 models PReLU-Net, got the 4.94% error rate. This is 26% relative improvement against GoogLeNet!!!

NorbertZheng commented 1 year ago

Object detection by using Fast R-CNN

PReLU-Net uses Fast R-CNN [7] implementation for object detection in PASCAL VOC 2007 dataset. image Model C + PReLU-Net has the best mAP result.

With imagenet pretrained model and fine-tuning on VOC 2007 dataset, Model C obtains better results than VGG-16.

NorbertZheng commented 1 year ago

References