Sik-Ho Tang | Review: AlexNet, CaffeNet -- Winner of ILSVRC 2012 (Image Classification).

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: AlexNet, CaffeNet — Winner of ILSVRC 2012 (Image Classification).

NorbertZheng commented 1 year ago

Overview

In this story, AlexNet and CaffeNet are reviewed. AlexNet is the winner of the ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2012, which is an image classification competition.

This is a 2012 NIPS paper from Prof. Hinton’s Group with about 28000 citations when I was writing this story. It has an essential breakthrough in deep learning which substantially reduce the error rate in ILSVRC 2012 as the figure shown below. Thus, this is a must read paper!!

ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories.

ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images and 150,000 testing images.

AlexNet, the winner in ILSVRC 2012 image classification with remarkable lower error rate.

A. For AlexNet, we will cover:

Architecture
ReLU (Rectified Linear Unit)
Multiple GPUs
Local Response Normalization
Overlapping Pooling
Data Augmentation
Dropout
Other Details of Learning Parameters
Results

B. For CaffeNet, it is just a single-GPU version of AlexNet. Since normally, people would only have one GPU, CaffeNet is a single-GPU network to simulate AlexNet. We will cover this as well at the end of this story.

By going through each component, we can know the importance of each component. Some of them are not so useful by now. But they do inspire for invention of other networks.

NorbertZheng commented 1 year ago

AlexNet

Architecture

AlexNet.

AlexNet contains eight layers:

Input: 227×227×3 input images (224×224×3 sizes is mentioned in the paper and also in the figure, however, it is later pointed out that it should be 227, or 224×224×3 is padded during the 1st convolution.)

1st: Convolutional Layer: 2 groups of 48 kernels, size 11×11×3 (stride: 4, pad: 0) Outputs 55×55 ×48 feature maps ×2 groups Then 3×3 Overlapping Max Pooling (stride: 2) Outputs 27×27 ×48 feature maps ×2 groups Then Local Response Normalization Outputs 27×27 ×48 feature maps ×2 groups

2nd: Convolutional Layer: 2 groups of 128 kernels of size 5×5×48 (stride: 1, pad: 2) Outputs 27×27 ×128 feature maps ×2 groups Then 3×3 Overlapping Max Pooling (stride: 2) Outputs 13×13 ×128 feature maps ×2 groups Then Local Response Normalization Outputs 13×13 ×128 feature maps ×2 groups

3rd: Convolutional Layer: 2 groups of 192 kernels of size 3×3×256 (stride: 1, pad: 1) Outputs 13×13 ×192 feature maps ×2 groups

4th: Convolutional Layer: 2 groups of 192 kernels of size 3×3×192 (stride: 1, pad: 1) Outputs 13×13 ×192 feature maps ×2 groups

5th: Convolutional Layer: 256 kernels of size 3×3×192 (stride: 1, pad: 1) Outputs 13×13 ×128 feature maps ×2 groups Then 3×3 Overlapping Max Pooling (stride: 2) Outputs 6×6 ×128 feature maps ×2 groups

6th: Fully Connected (Dense) Layer of 4096 neurons

7th: Fully Connected (Dense) Layer of 4096 neurons

8th: Fully Connected (Dense) Layer of Outputs 1000 neurons (since there are 1000 classes) Softmax is used for calculating the loss.

In total, there are 60 million parameters need to be trained !!!

NorbertZheng commented 1 year ago

ReLU

Before Alexnet, Tanh was used. ReLU is introduced in AlexNet.

ReLU is six times faster than Tanh to reach 25% training error rate.

NorbertZheng commented 1 year ago

Multiple GPUs

At that moment, NVIDIA GTX 580 GPU is used which only got 3GB of memory. Thus, we can see in the architecture that they split into two paths and use 2 GPUs for convolutions. Inter-communications are only occurred at one specific convolutional layer.

Thus, using 2 GPUs, is due to memory problem, NOT for speeding up the training process.

With the whole network compared with a net with only half of kernels (only one path), Top-1 and top-5 error rates are reduced by 1.7% and 1.2% respectively.

More filters is better!!!

NorbertZheng commented 1 year ago

Local Response Normalization

Normalization.

In AlexNet, local response normalization is used. It is different from the batch normalization as we can see in the equations. Normalization helps to speed up the convergence.

Nowadays, batch normalization is used instead of using local response normalization.

With local response normalization, Top-1 and top-5 error rates are reduced by 1.4% and 1.2% respectively.

NorbertZheng commented 1 year ago

Overlapping Pooling

Overlapping Pooling is the pooling with stride smaller than the kernel size while Non-Overlapping Pooling is the pooling with stride equal to or larger than the kernel size.

With overlapping pooling, Top-1 and top-5 error rates are reduced by 0.4% and 0.3% respectively.

NorbertZheng commented 1 year ago

Data Augmentation

Two forms of data augmentation.

First: Image translation and horizontal reflection (mirroring) A random 224×224 is extracted from one 256×256 image plus horizontal reflection. The size of training set is increased by a factor of 2048. This can be calculated as follows:

By image translation: (256–224)²=32²=1024

By horizontal reflection: 1024 × 2 = 2048

At the test time, four corner patches plus the centre patch as well as their corresponding horizontal reflections (10 patches in total), are used for prediction, and get the average of all results to obtain the final classification result.

Second: Altering the intensity PCA is perform on the training set. For each training image, add the quantity:

Quantity of intensity altered.

where $p{i}$ and $\lambda{i}$ are $i$-th eigenvector and eigenvalue of the 3×3 covariance matrix of RGB pixel values, respectively, and αi is the random variable with mean 0 and standard variation 0.1.

By increasing the size of training set with data augmentation, Top-1 error rate is reduced by over 1%.

NorbertZheng commented 1 year ago

Dropout

Dropout.

With the layer that using dropout, during training, each neuron has a probability not to contribute to feed forward pass and participate in backpropagation. Thus, each neuron can have a larger chance to be trained, and not to depend so much for some very “strong” neuron.

During test time, there will be no dropout.

In AlexNet, probability of 0.5 is used at the first two fully-connected layers. Dropout is a kind of regularization technique to reduce the overfitting.

NorbertZheng commented 1 year ago

Other Details of Learning Parameters

Batch size: 128 Momentum v: 0.9 Weight Decay: 0.0005 Learning rate ϵ: 0.01, reduced by 10 manually when validation error rate stopped improving, and reduced by 3 times.

The update of momentum v and weight w.

Training set of 1.2 million images. Network is trained for roughly 90 cycles. Five to six days on two NVIDIA GTX 580 3GB GPUs.

NorbertZheng commented 1 year ago

Results

Error Rate in ILSVRC 2010.

For ILSVRC 2010, AlexNet got the Top-1 and top-5 error rates of 37.5% and 17.0% respectively, which outperforms other approaches.
Without ~~averaging 10 predictions over ten patches by data augmentation~~, AlexNet only got the Top-1 and top-5 error rates of 39.0% and 18.3% respectively.

Error Rate in ILSVRC 2012.

By 1 AlexNet (1 CNN), the validation error rate is 18.2%.
By Averaging the prediction from 5 AlexNet (5 CNNs), the error rate is reduced to 16.4%. This is a kind of boosting technique already used in LeNet for digit classification.
By adding one more convolutional layer to AlexNet (1 CNN*), the validation error rate is reduced to 16.6%.
By Averaging the prediction from 2 modfiied AlexNet and 5 original AlexNet (7 CNNs*), the validation error rate is reduced to 15.4%.

Some Top-5 results by AlexNet.

NorbertZheng commented 1 year ago

Boosting!!! Boosting!!!

NorbertZheng commented 1 year ago

CaffeNet

CaffeNet is a 1-GPU version of AlexNet. The architecture is: CaffeNet.

We can see that the 2 paths in AlexNet are combined to become one path.

It is noted that for early version of CaffeNet, the order of pooling and normalization layers is reversed, this is by accident. But in the current version of CaffeNet provided by Caffe, it has already provided the Caffenet with the correct order of pooling and normalization layers.

NorbertZheng commented 1 year ago

By investigating each component one by one, we can know the effectiveness of each component. : )

If interested, there is also a tutorial about CaffeNet quick setup using Nvidia-Docker and Caffe [3].

NorbertZheng commented 1 year ago

References

[2012 NIPS] [AlexNet] ImageNet Classification with Deep Convolutional Neural Networks.
[2014 ACM MM] [CaffeNet] Caffe: Convolutional Architecture for Fast Feature Embedding.
VERY QUICK SETUP of CaffeNet (AlexNet) for Image Classification Using Nvidia-Docker 2.0 + CUDA + CuDNN + Jupyter Notebook + Caffe.
ILSVRC ImageNet Large Scale Visual Recognition Competition.

Also, Read

Copy Trading | Crypto Tax Software
Grid Trading | Crypto Hardware Wallet
Crypto Telegram Signals | Crypto Trading Bot
Best Crypto Exchange | Best Crypto Exchange in India
- Best Crypto APIs for Developers
Best Crypto Lending Platform
An ultimate guide to Leveraged Token

NorbertZheng / read-papers