Sik-Ho Tang | Review: SPPNet --1st Runner Up (Object Detection), 2nd Runner Up (Image Classification) in ILSVRC 2014.

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: SPPNet —1st Runner Up (Object Detection), 2nd Runner Up (Image Classification) in ILSVRC 2014.

NorbertZheng commented 1 year ago

Overview

In this story, SPPNet is reviewed. SPPNet has introduced a new technique in CNN called

Spatial Pyramid Pooling (SPP) at the transition of convolutional layer and fully connected layer.

This is a work from Microsoft.

In ILSVRC 2014, SPPNet has got 1st Runner Up in Object Detection, 2nd Runner Up in Image Classification, and 5th Place in Localization Task. And it got 2 papers in 2014 ECCV [1] and 2015 TPAMI [2] with over 1000 and 600 citations respectively. Thus, SPPNet is one of the worth reading deep learning papers.

NorbertZheng commented 1 year ago

Dataset

Classification: Over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3M/50k/100k images are used for the training/validation/testing sets

Detection: 200 categories. 450k/20k/40k images are used for the training/validation/testing sets.

NorbertZheng commented 1 year ago

Spatial Pyramid Pooling (SPP)

Three-Level Spatial Pyramid Pooling (SPP) in SPPNet with Pyramid {4×4, 2×2, 1×1}.

Conventionally, at the transition of conv layer and FC layer, there is one single pooling layer or even no pooling layer.

In SPPNet, it suggests to have multiple pooling layers with different scales.

In the above figure, 3-level SPP is used. Suppose the conv5 layer has 256 feature maps. Then at SPP layer,

First, each feature map is pooled to become one value (grey), thus 256-d vector is formed.
Then, each feature map is pooled to have 4 values (green), and form a 4×256-d vector.
Similarly, each feature map is pooled to have 16 values (blue), and form a 16×256-d vector.
The above 3 vectors are concatenated to form a 1-d vector.
Finally, this 1-d vector is going into FC layers as usual.

With SPP, we don’t need to crop the image to fixed size, like AlexNet, before going into CNN. Any image sizes can be inputted.

NorbertZheng commented 1 year ago

Multi-Size Training

SPPNet supports any sizes due to the use of SPP.

With SPP, variable sizes are accepted as input, different sizes should be inputted into network to increase the robustness of the network during training.

However, for the effectiveness of the training process, only 224×224 and 180×180 images are used as input. Two networks, 180-network and 240-network are trained with shared parameters.

Authors replicated ZFNet [3], AlexNet [4] and Overfeat [5] with modifications as below (The number after it is the conv layer number): Replicated Model as Baseline.

Top-5 Error Rates for SPP and Multi-Size Training.

4-level SPPNet is used here with the pyramid {6×6, 3×3, 2×2, 1×1}.

As shown above, with SPP only, the error rates have been improved for all models. With Multi-Size Training, the error rates improve further. (10-view means the 10-crop testing from [four corners + 1 center] and the corresponding horizontal flips).

NorbertZheng commented 1 year ago

Full Image Representation

As full image can also input into CNN with the use of SPP, authors compare full image input with only using 1 center crop input:

Top-1 Error Rates for Full Image Representation.

Top-1 error rates are all improved with full image as input.

NorbertZheng commented 1 year ago

Multi-View Testing

With full image support by using SPP, multi-view testing can be facilitated easily.

Authors resize the image to 6 scales: {224, 256, 300, 360, 448, 560}.
For each scale, 18 views are generated: {1 center, 4 corners, 4 on the middle of each side} and the corresponding flips. Thus, there are in total 96 views.
And 2 full-image views plus corresponding flips are also included.

Error Rates in ILSVRC 2012 (All are Single Model Results).

SPPNet using OverFeat-7 obtains 9.14/9.08% Top-5 Error Rate on validation/test set which is the only one under 10% in the table.

NorbertZheng commented 1 year ago

Comparison with State-of-the-art Approaches (Classification)

11 models of SPPNet are used in testing. The outputs are averaged to get a more accurate prediction. This is a boosting or ensemble technique used in many CNN models such as LeNet, AlexNet, ZFNet.

2nd Runner Up in Image Classification (ILSVRC 2014).

8.06% error rate is obtained. Unfortunately, VGGNet and GoogLeNet have better performance with the use of deep models. Finally, SPPNet can only got 2nd runner-up in the classification task.

NorbertZheng commented 1 year ago

SPPNet in Object Detection

Selective Search [6] is used to generate about 2k region proposals (bounding boxes) just like in R-CNN [7].
The input image goes through SPPNet using ZFNet by ONLY one time.
At the last conv layer, feature maps bounded by each region proposal is going into SPP layer then FC layer as shown below:

SPPNet for Object Detection.

Compared with R-CNN, SPPNet processes the image at conv layers for only one time while R-CNN processes the image at conv layers for 2k times since there are 2k region proposal. The image below illustrates the idea:

R-CNN (Left) and SPPet (Right).

After the FC layer for each bounding box, SVM and bounding box regressor are also needed, which is not an end-to-end learning architecture.

NorbertZheng commented 1 year ago

Comparison with State-of-the-art Approaches (Detection)

VOC 2007

VOC 2007 Results.

Some Amazing Results in VOC 2007.

In VOC 2007 as shown above, compared with R-CNN, SPPNet with 5 scales obtained higher mAP of 59.2%.

ILSVRC 2014

SPPNet got 1st Runner-Up in ILSVRC 2014 Object Detection.

In ILSVRC 2014, SPPNet obtains 35.1% mAP and got 1st runner-up in object detection task.

NorbertZheng commented 1 year ago

References

[2014 ECCV] [SPPNet] Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.
[2015 TPAMI] [SPPNet] Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.
[2014 ECCV] [ZFNet] Visualizing and Understanding Convolutional Networks.
[2012 NIPS] [AlexNet] ImageNet Classification with Deep Convolutional Neural Networks.
[2014 ICLR] [OverFeat] OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks.
[2013 IJCV] [Selective Search] Selective Search for Object Recognition.
[2014 CVPR] [R-CNN] Rich feature hierarchies for accurate object detection and semantic segmentation.

NorbertZheng / read-papers