Closed NorbertZheng closed 1 year ago
In this story, SPPNet is reviewed. SPPNet has introduced a new technique in CNN called
This is a work from Microsoft.
In ILSVRC 2014, SPPNet has got 1st Runner Up in Object Detection, 2nd Runner Up in Image Classification, and 5th Place in Localization Task. And it got 2 papers in 2014 ECCV [1] and 2015 TPAMI [2] with over 1000 and 600 citations respectively. Thus, SPPNet is one of the worth reading deep learning papers.
Classification: Over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3M/50k/100k images are used for the training/validation/testing sets
Detection: 200 categories. 450k/20k/40k images are used for the training/validation/testing sets.
Three-Level Spatial Pyramid Pooling (SPP) in SPPNet with Pyramid {4×4, 2×2, 1×1}.
Conventionally, at the transition of conv layer and FC layer, there is one single pooling layer or even no pooling layer.
In the above figure, 3-level SPP is used. Suppose the conv5 layer has 256 feature maps. Then at SPP layer,
With SPP, we don’t need to crop the image to fixed size, like AlexNet, before going into CNN. Any image sizes can be inputted.
SPPNet supports any sizes due to the use of SPP.
With SPP, variable sizes are accepted as input, different sizes should be inputted into network to increase the robustness of the network during training.
However, for the effectiveness of the training process, only 224×224 and 180×180 images are used as input. Two networks, 180-network and 240-network are trained with shared parameters.
Authors replicated ZFNet [3], AlexNet [4] and Overfeat [5] with modifications as below (The number after it is the conv layer number): Replicated Model as Baseline.
Top-5 Error Rates for SPP and Multi-Size Training.
4-level SPPNet is used here with the pyramid {6×6, 3×3, 2×2, 1×1}.
As shown above, with SPP only, the error rates have been improved for all models. With Multi-Size Training, the error rates improve further. (10-view means the 10-crop testing from [four corners + 1 center] and the corresponding horizontal flips).
As full image can also input into CNN with the use of SPP, authors compare full image input with only using 1 center crop input:
Top-1 Error Rates for Full Image Representation.
Top-1 error rates are all improved with full image as input.
With full image support by using SPP, multi-view testing can be facilitated easily.
Error Rates in ILSVRC 2012 (All are Single Model Results).
SPPNet using OverFeat-7 obtains 9.14/9.08% Top-5 Error Rate on validation/test set which is the only one under 10% in the table.
11 models of SPPNet are used in testing. The outputs are averaged to get a more accurate prediction. This is a boosting or ensemble technique used in many CNN models such as LeNet, AlexNet, ZFNet.
2nd Runner Up in Image Classification (ILSVRC 2014).
8.06% error rate is obtained. Unfortunately, VGGNet and GoogLeNet have better performance with the use of deep models. Finally, SPPNet can only got 2nd runner-up in the classification task.
SPPNet for Object Detection.
Compared with R-CNN, SPPNet processes the image at conv layers for only one time while R-CNN processes the image at conv layers for 2k times since there are 2k region proposal. The image below illustrates the idea:
R-CNN (Left) and SPPet (Right).
After the FC layer for each bounding box, SVM and bounding box regressor are also needed, which is not an end-to-end learning architecture.
VOC 2007 Results.
Some Amazing Results in VOC 2007.
In VOC 2007 as shown above, compared with R-CNN, SPPNet with 5 scales obtained higher mAP of 59.2%.
SPPNet got 1st Runner-Up in ILSVRC 2014 Object Detection.
In ILSVRC 2014, SPPNet obtains 35.1% mAP and got 1st runner-up in object detection task.
Sik-Ho Tang. Review: SPPNet —1st Runner Up (Object Detection), 2nd Runner Up (Image Classification) in ILSVRC 2014.