Closed NorbertZheng closed 1 year ago
In this story, PReLU-Net [1] is reviewed. Parametric Rectified Linear Unit (PReLU) is proposed to generalize the traditional rectified unit (ReLU). This is the deep learning approach which is first to surpass the human-level performance in ILSVRC (ImageNet Large Scale Visual Recognition Challenge) image classification. In addition, a better weight initialization for rectifiers is proposed which can help with the convergence of deep model (30 layers) trained directly from scratch.
Finally, PReLU-Net obtains 4.94% top-5 error rate on test set which is better than the human-level performance of 5.1%, and GoogLeNet of 6.66% !!!
And this is the paper in 2015 ICCV and it got about 3000 citations at the moment I am writing this story.
Classification: Over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3M/50k/100k images are used for the training/validation/testing sets.
In AlexNet [2], ReLU is suggested as below where only positive values would pass through the ReLU activation function while all negative values are set to zero. And ReLU outperform Tanh with much faster training speed due to non-saturation at 1.
PReLU (LeakyReLU?)
It is noted that when a = 0, it is ReLU. When a = 0.01, it is Leaky ReLU. Now the value of a can be learnt, therefore becoming a generalized ReLU.
During backpropagation, we can estimate the gradient: Backpropagation, gradient from deep layer (Left), gradient of the activation (Right).
We can estimate the gradient from the deep layer (left), and the gradient of the activation (right). We can see that it is the sum of all positions of the feature map (Channel-wise). If it is channel-shared variant, it is the sum all over the channels of the layer. No weight decay is applied to a.
The average value of a over all channels for each layer.
Two interesting phenomena observed:
Since weight initialization depends on Gaussian distribution with mean 0 and different variance depending on the algorithm like Xavier. Thus, by considering the input and output network sizes, a better weight initialization is suggested.
With $L$ layers putting together, the variance is (left): Variance of L layers (Left), and sufficient condition (right).
If the sufficient condition at the right is met, the network can become stable. Thus, finally, the variance should be $\frac{2}{n{l}}$ where $n{l}$ is the number of connection in $l$-th layer.
(There is also proof at backward propagation case. It is also interesting that it also comes up with the same sufficient condition. But I just not to show it here.)
Red (Ours) and Blue (Xavier), 22-layer (Left) and 30-layer (Right).
As shown above, the suggested weight initialization converges faster. And Xavier even cannot converge for the deeper layer at the right when training from scratch.
PReLU-Net: Model A, B, C.
Model A using PReLU is better than the one using ReLU.
Model A: PReLU converges faster.
Single model, 10-view.
By using just single model and 10-view (data augmentation), Model C has 7.38% error rate.
Single model, Multi-view, Multi-scale.
With multi-view and multi-scale, Model C has 5.71% error rate. This result is already better than even the multi model of SPPNet [3–4], VGGNet [5] and GoogLeNet [6].
Multi-model, Multi-view, Multi-scale.
With multi-model, i.e. 6 models PReLU-Net, got the 4.94% error rate. This is 26% relative improvement against GoogLeNet!!!
PReLU-Net uses Fast R-CNN [7] implementation for object detection in PASCAL VOC 2007 dataset. Model C + PReLU-Net has the best mAP result.
With imagenet pretrained model and fine-tuning on VOC 2007 dataset, Model C obtains better results than VGG-16.
Sik-Ho Tang. Review: PReLU-Net — The First to Surpass Human-Level Performance in ILSVRC 2015 (Image Classification).