Closed NorbertZheng closed 1 year ago
In this story, VGGNet [1] is reviewed. VGGNet is invented by VGG (Visual Geometry Group) from University of Oxford, Though VGGNet is the 1st runner-up, not the winner of the ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2014 in the classification task, which has significantly improvement over ZFNet (The winner in 2013) [2] and AlexNet (The winner in 2012) [3]. And GoogLeNet is the winner of ILSVLC 2014, I will also talk about it later.) Nevertheless, VGGNet beats the GoogLeNet and won the localization task in ILSVRC 2014.
And it is the first year that there are deep learning models obtaining the error rate under 10%. The most important is that
That’s why we need to know about VGGNet! That is also why this is a 2015 ICLR paper with more than 14000 citations when I was writing this story.
ILSVRC 2014 Ranking [4].
Usually, people only talked about VGG-16 and VGG-19. I will talk about VGG-11, VGG-11 (LRN), VGG-13, VGG-16 (Conv1), VGG-16 and VGG-19 by ablation study in the paper.
Dense testing, usually ignored, will also be covered.
ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3 million training images, 50,000 validation images and 100,000 testing images.
ILSVRC.
2 layers of 3×3 filters already covered the 5×5 area.
By using 2 layers of 3×3 filters, it actually have already covered 5×5 area as in the above figure. By using 3 layers of 3×3 filters, it actually have already covered 7×7 effective area.
(If interested, please go to my stories about the ZFNet[5] and AlexNet[6].)
Another reason is that the number of parameters are fewer. Suppose there is only 1 filter per layer, 1 layer at input, and exclude the bias:
1 layer of 11×11 filter, number of parameters = 11×11=121 5 layer of 3×3 filter, number of parameters = 3×3×5=45 Number of parameters is reduced by 63%
1 layer of 7×7 filter, number of parameters = 7×7=49 3 layers of 3×3 filters, number of parameters = 3×3×3=27 Number of parameters is reduced by 45%
By using 1 layer of 5×5 filter, number of parameters = 5×5=25 By using 2 layers of 3×3 filters, number of parameters = 3×3+3×3=18 Number of parameters is reduced by 28%
Larger network, hungrier the network for the training images. There are also vanishing gradient problem.
With fewer parameters to be learnt, it is better for faster convergence, and reduced overfitting problem.
Just like the idea that using dilation convolution to replace conventional convolution + pooling.
Different VGG Layer Structures Using Single Scale (256) Evaluation.
To obtain the optimum deep learning layer structure, ablation study has been done as shown in the above figure.
By observing the addition of layers one by one, we can observe that VGG-16 and VGG-19 start converging and the accuracy improvement is slowing down. When people are talking about VGGNet, they usually mention VGG-16 and VGG-19.
As object has different scale within the image, if we only train the network at the same scale, we might miss the detection or have the wrong classification for the objects with other scales. To tackle this, authors propose multi-scale training.
For single-scale training, an image is scaled with smaller-size equal to 256 or 384, i.e. S=256 or 384. Since the network accepts 224×224 input images only. The scaled image will be cropped to 224×224. The concept is as follows: Single-Scale Training with S=256 and S=384.
For multi-scale training, an image is scaled with smaller-size equal to a range from 256 to 512, i.e. S=[256;512], then cropped to 224×224. Therefore, with a range of S, we are inputting different scaled objects into the network for training.
Multi-Scale Training Results.
VGG-13 reduced the error rate from 9.4%/9.3% to 8.8%. VGG-16 reduced the error rate from 8.8%/8.7% to 8.1%. VGG-19 reduced the error rate from 9.0%/8.7% to 8.0%.
Similar to multi-scale training, multi-scale testing can also reduce the error rate since we do not know the size of object in the test image.
Multi-Scale Testing Results.
By using multi-scale testing but single-scale training, error rate is reduced. Compared to single-scale training single-scale testing,
VGG-13 reduced the error rate from 9.4%/9.3% to 9.2%. VGG-16 reduced the error rate from 8.8%/8.7% to 8.6%. VGG-19 reduced the error rate from 9.0%/8.7% to 8.7/8.6%.
By using both multi-scale training and testing, error rate is reduced. Compared with only multi-scale testing,
VGG-13 reduced the error rate from 9.2%/9.2% to 8.2%, VGG-16 reduced the error rate from 8.6%/8.6% to 7.5%, VGG-19 reduced the error rate from 8.7%/8.6% to 7.5%,
Seems like that we are manipulating the sample frequency of brain recordings!
During testing, in AlexNet, the 4 corners and center of the image as well as their horizontal flips are cropped for testing, i.e. 10 times of testing. And the output probability vectors are added or averaged to get a better result.
The VGGNet is different from the one in training as shown below:
VGGNet During Testing.
The first FC is replaced by 7×7 conv. The second and third FC are replaced by 1×1 conv. Thus, all FC layers are replaced by conv layers.
During testing, in VGGNet, the test image is directly go through the VGGNet and obtain a class score map. This class score map is spatially averaged to be a fixed-size vector.
Workflow of VGGNet Testing.
There are only 2 times of testing if we also include the horizontal flip as well. Dense (VGGNet), Multi-crop (Approach by AlexNet), Dense+Multi-crop (Both).
Fusion All Techniques Mentioned Above.
By combining VGG-16 and VGG-19 plus multi-scale training, multi-scale testing, mutli-crop and dense, error rate is reduced to 6.8%.
Comparison Between VGGNet and GoogLeNet.
Compared with GoogLeNet using 7-nets which has error rate of 6.7%, VGGNet using 2-nets, plus multi scale training, multi-scale testing, mutli-crop and dense has error rate of 6.8% which are competitive.
With only 1-net, VGGNet has 7.0% error rate which is better than GoogLeNet, that has 7.9% error rate.
However, at the submission of ILSVRC 2014, VGGNet has 7.3% error rate only which got 1st runner up at the moment.
For the localization task, a bounding box is represented by a 4-D vector storing its center coordinates, width, and height. Thus, the logistic regression objective is replaced with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.
There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR) or is class-specific (per-class regression, PCR). In the former case, the last layer is 4-D, while in the latter it is 4000-D (since there are 1000 classes in the dataset).
Localization Results.
As shown above, PCR is better than SCR. And fine-tuning all layers is better than just fine-tuned the 1st and 2nd FC layers. The results above is only obtained by using just center crop.
Multi-Scale Training and Testing.
With multiple training and testing which just described in previous sections, the top-5 localization error is reduced to 25.3%.
Comparison with state-of-the-art results.
VGGNet even outperforms GoogLeNet as shown above and won the localization task in ILSVRC 2014.
VOC 2007, 2012 and Caltech 101 and 256 Dataset Results.
VGGNet has the best results on VOC 2007, 2012 and Caltech 256 dataset. And it also has competitive result on Caltech 101 dataset.
Sik-Ho Tang. Review: VGGNet — 1st Runner-Up (Image Classification), Winner (Localization) in ILSVRC 2014.