Closed NorbertZheng closed 1 year ago
In this story, AlexNet and CaffeNet are reviewed. AlexNet is the winner of the ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2012, which is an image classification competition.
This is a 2012 NIPS paper from Prof. Hinton’s Group with about 28000 citations when I was writing this story. It has an essential breakthrough in deep learning which substantially reduce the error rate in ILSVRC 2012 as the figure shown below. Thus, this is a must read paper!!
ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images and 150,000 testing images.
AlexNet, the winner in ILSVRC 2012 image classification with remarkable lower error rate.
A. For AlexNet, we will cover:
B. For CaffeNet, it is just a single-GPU version of AlexNet. Since normally, people would only have one GPU, CaffeNet is a single-GPU network to simulate AlexNet. We will cover this as well at the end of this story.
By going through each component, we can know the importance of each component. Some of them are not so useful by now. But they do inspire for invention of other networks.
AlexNet.
AlexNet contains eight layers:
Input: 227×227×3 input images (224×224×3 sizes is mentioned in the paper and also in the figure, however, it is later pointed out that it should be 227, or 224×224×3 is padded during the 1st convolution.)
1st: Convolutional Layer: 2 groups of 48 kernels, size 11×11×3 (stride: 4, pad: 0) Outputs 55×55 ×48 feature maps ×2 groups Then 3×3 Overlapping Max Pooling (stride: 2) Outputs 27×27 ×48 feature maps ×2 groups Then Local Response Normalization Outputs 27×27 ×48 feature maps ×2 groups
2nd: Convolutional Layer: 2 groups of 128 kernels of size 5×5×48 (stride: 1, pad: 2) Outputs 27×27 ×128 feature maps ×2 groups Then 3×3 Overlapping Max Pooling (stride: 2) Outputs 13×13 ×128 feature maps ×2 groups Then Local Response Normalization Outputs 13×13 ×128 feature maps ×2 groups
3rd: Convolutional Layer: 2 groups of 192 kernels of size 3×3×256 (stride: 1, pad: 1) Outputs 13×13 ×192 feature maps ×2 groups
4th: Convolutional Layer: 2 groups of 192 kernels of size 3×3×192 (stride: 1, pad: 1) Outputs 13×13 ×192 feature maps ×2 groups
5th: Convolutional Layer: 256 kernels of size 3×3×192 (stride: 1, pad: 1) Outputs 13×13 ×128 feature maps ×2 groups Then 3×3 Overlapping Max Pooling (stride: 2) Outputs 6×6 ×128 feature maps ×2 groups
6th: Fully Connected (Dense) Layer of 4096 neurons
7th: Fully Connected (Dense) Layer of 4096 neurons
8th: Fully Connected (Dense) Layer of Outputs 1000 neurons (since there are 1000 classes) Softmax is used for calculating the loss.
In total, there are 60 million parameters need to be trained !!!
Before Alexnet, Tanh was used. ReLU is introduced in AlexNet.
At that moment, NVIDIA GTX 580 GPU is used which only got 3GB of memory. Thus, we can see in the architecture that they split into two paths and use 2 GPUs for convolutions. Inter-communications are only occurred at one specific convolutional layer.
Thus, using 2 GPUs, is due to memory problem, NOT for speeding up the training process.
With the whole network compared with a net with only half of kernels (only one path), Top-1 and top-5 error rates are reduced by 1.7% and 1.2% respectively.
Normalization.
In AlexNet, local response normalization is used. It is different from the batch normalization as we can see in the equations. Normalization helps to speed up the convergence.
With local response normalization, Top-1 and top-5 error rates are reduced by 1.4% and 1.2% respectively.
Overlapping Pooling is the pooling with stride smaller than the kernel size while Non-Overlapping Pooling is the pooling with stride equal to or larger than the kernel size.
With overlapping pooling, Top-1 and top-5 error rates are reduced by 0.4% and 0.3% respectively.
Two forms of data augmentation.
First: Image translation and horizontal reflection (mirroring) A random 224×224 is extracted from one 256×256 image plus horizontal reflection. The size of training set is increased by a factor of 2048. This can be calculated as follows:
By image translation: (256–224)²=32²=1024
By horizontal reflection: 1024 × 2 = 2048
At the test time, four corner patches plus the centre patch as well as their corresponding horizontal reflections (10 patches in total), are used for prediction, and get the average of all results to obtain the final classification result.
Second: Altering the intensity PCA is perform on the training set. For each training image, add the quantity:
Quantity of intensity altered.
where $p{i}$ and $\lambda{i}$ are $i$-th eigenvector and eigenvalue of the 3×3 covariance matrix of RGB pixel values, respectively, and αi is the random variable with mean 0 and standard variation 0.1.
By increasing the size of training set with data augmentation, Top-1 error rate is reduced by over 1%.
Dropout.
With the layer that using dropout, during training, each neuron has a probability not to contribute to feed forward pass and participate in backpropagation. Thus, each neuron can have a larger chance to be trained, and not to depend so much for some very “strong” neuron.
During test time, there will be no dropout.
In AlexNet, probability of 0.5 is used at the first two fully-connected layers. Dropout is a kind of regularization technique to reduce the overfitting.
Batch size: 128 Momentum v: 0.9 Weight Decay: 0.0005 Learning rate ϵ: 0.01, reduced by 10 manually when validation error rate stopped improving, and reduced by 3 times.
The update of momentum v and weight w.
Training set of 1.2 million images. Network is trained for roughly 90 cycles. Five to six days on two NVIDIA GTX 580 3GB GPUs.
Error Rate in ILSVRC 2010.
Error Rate in ILSVRC 2012.
Some Top-5 results by AlexNet.
Boosting!!! Boosting!!!
CaffeNet is a 1-GPU version of AlexNet. The architecture is: CaffeNet.
We can see that the 2 paths in AlexNet are combined to become one path.
It is noted that for early version of CaffeNet, the order of pooling and normalization layers is reversed, this is by accident. But in the current version of CaffeNet provided by Caffe, it has already provided the Caffenet with the correct order of pooling and normalization layers.
By investigating each component one by one, we can know the effectiveness of each component. : )
If interested, there is also a tutorial about CaffeNet quick setup using Nvidia-Docker and Caffe [3].
Also, Read
Sik-Ho Tang. Review: AlexNet, CaffeNet — Winner of ILSVRC 2012 (Image Classification).