Sik-Ho Tang | Review: Trimps-Soushen -- Winner in ILSVRC 2016 (Image Classification).

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: Trimps-Soushen — Winner in ILSVRC 2016 (Image Classification).

NorbertZheng commented 1 year ago

Overview

Inthis story, the approach by the winner, Trimps-Soushen, in ILSVRC 2016 classification task, is reviewed. Trimps stands for The Third Research Institute of Ministry of Public Security, or in chinese 公安部三所. In brief, Trimps is the research institute for advancing the technologies for public security in China, which was launched in 1978 at Shanghai. Soushen should be the team name under Trimps, in chinese 搜神. It means god of search, where Sou (搜) means search and Shen (神) means god.

Trimps-Soushen has won several competitions in 2016:

Object Localization: 1st place, 7.71% error.
Object Classification: 1st place, 2.99% error.
Object Detection: 3rd place, 61.82% mAP.
Scene Classification: 3rd place, 10.3% error.
Object Detection from video: 3rd place, 70.97% mAP.

ILSVRC 2016 Classification Ranking http://image-net.org/challenges/LSVRC/2016/results#loc

ILSVRC Classification Results from 2011 to 2016.

Though Trimps-Soushen has the state-of-the-art results on multiple recognition tasks, there is no new innovative technology or novelty by Trimps-Soushen. Maybe due to this reason, they haven’t published any papers or technical reports about it.

Instead, they only shared their results in the ImageNet and COCO joint workshop in 2016 ECCV. And they have some funny facts about the dataset.

NorbertZheng commented 1 year ago

Ensemble Using Different Models

ImageNet Classification Errors for Top-10 Difficult Categories.

Trimps-Soushen used the pretrained models from Inception-v3, Inception-v4, Inception-ResNet-v2, Pre-Activation ResNet-200, and Wide ResNet (WRN-68–2) for classification, and found out Top-10 difficult categories as above.

Diverse results are obtained, which means there is no models being dominant for all categories. Each of the models are strong at classifying some categories, but also weak at classifying some categories.
The diversity of models can be used for improving the accuracy. E.g. Boosting!!!

During training, Trimps-Soushen just performed multi-scale augmentation & large mini batch size. During testing, multi-scale + flip are used with dense fusion.

ImageNet Top-5 Error Rate Results.

The validation errors for the 5 best models are from 3.52% to 4.65%.
By ensembling these 5 models (Inception-ResNet-v2 has higher weight), 2.92% validation error for is obtained.
And 2.99% test error is obtained which is the first to obtain under 3% error rate.

NorbertZheng commented 1 year ago

Some Findings Based on Top-20 Accuracy

Top-k Accuracy.

Top-k Accuracy is obtained as shown above. When k=20, 99.27% accuracy is obtained. The error rate is smaller than 1%.

Why there are still errors when Top-20 accuracy is used?

Trimp-Soushen has analysed those 1% error images in very detail!!!

They manually analysed 1458 error images from validation set. And roughly 7 categories of errors are obtained as below:

7 Error Categories.

NorbertZheng commented 1 year ago

Label May Wrong

Label May Wrong (Maybe it is really a sleeping bag for Hello Kitty? lol).

The ground truth is sleeping bag, but obviously, it is a pencil box!!!!

This is because the ground truths are manually labelled in ImageNet dataset. As ImageNet is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories, and a subset of 1000-categories ImageNet dataset is used for competition, there would be some wrong labels.

There are 211 out of 1458 error images which are “labels May Wrong”, which is about 15.16%.

NorbertZheng commented 1 year ago

Multiple Objects (>5)

Multiple Objects (>5) (Which is the main object?).

The above image contains multiple objects (>5). Actually this kind of images is not suitable for ILSVRC classification task. Because in ILSVRC classification task, only one class should be identified for each image.

There are 118 out of 1458 error images which are “Multiple Objects (>5)”, which is about 8.09%.

NorbertZheng commented 1 year ago

Non-Obvious Main Object

Non-Obvious Main Object (Please find the paper towel in the image, lol !!)

As only one class should be identified in classification task, the above image does not have one obvious main object in the image. It can be boat or dock. But the ground truth is paper towel.

There are 355 out of 1458 error images which are “Non-Obvious Main Object”, which is about 24.35%.

NorbertZheng commented 1 year ago

Confusing Label

Confusing Label (Maybe there is no sunscreen inside, lol.)

The ground truth is sunscreen. This time, the label seems to be correct as the carton saying about SPF30. But the task would become to understand the meaning of the text on the carton, which is going too far from the original objective of recognizing objects based on shapes and colors.

There are 206 out of 1458 error images which are “Confusing Label”, which is about 14.13%.

Require semantic understanding!!!

NorbertZheng commented 1 year ago

Fine-Grained Label

Fine-Grained Label.

The ground truth is correct. Both bolete and stinkhorn are the types of fungal. Indeed, this type of label is even difficult for human to identify.

There are 258 out of 1458 error images which are “Fine-Grained Label”, which is about 17.70%.

The network can improve this category.

NorbertZheng commented 1 year ago

Obvious Wrong

Obvious Wrong.

The ground truth is correct. And the network cannot predict it even using top-20 prediction.

There are 234 out of 1458 error images which are “Obvious Wrong”, which is about 16.05%.

The network can improve this category.

NorbertZheng commented 1 year ago

Partial Object

Partial Object The image may only contain a part of the object, which is hard to recognize. Maybe the image can be better if it is zoomed out with multiple tables and chairs, looking like a restaurant.

There are 66 out of 1458 error images which are “Partial Object”, which is about 4.53%.

NorbertZheng commented 1 year ago

Therefore, the accuracy is hard to be improved by 1%.

NorbertZheng commented 1 year ago

Region Fusion (Image Localization)

Region Fusion for Image Localization.

To localize the top-5 predicted labels within the image, Faster R-CNN architecture using multiple models is used. Multiple models are used to generate region proposals via the region proposal network (RPN) in Faster R-CNN. Then based on the top-5 classification predicted labels, to perform the localization prediction.

Image Localization Top-5 Validation Error Results.

Previously state-of-the-art approaches: 8.51% to 9.27% error are obtained.
Ensemble all: By fusion all approaches together, 7.58% error is achieved.
Ensemble all but without one model: Only 7.75 to 7.93% error is obtained.

Thus, diversity between model is important and contribute large improvement on prediction accuracy.

ILSVRC Localization Top-5 Test Error Results from 2012 to 2016.

NorbertZheng commented 1 year ago

Multi-Model Fusion for Other Tasks (Scene Classification / Object Detection / Object Detection from Video)

Scene Classification

Multi-Scale & Multi-Model Fusion for Scene Classification.

Improved multi-scale approach is used by concatenating the results instead of just adding the results together at the end of network.

On top of using multi-scale inputs for prediction using the same model, two trained models (I believe using the same model network) are used by concatenating the results and goes through FC and softmax. 10.80% validation error is obtained.

By using 7 * two models, 10.39% validation error and 10.42% test error are obtained.

Scene Classification Top-5 Test Error Results.

With the model using external dataset, Places2, for pretraining as well, 10.3% top-5 test error is obtained which got 3rd place in scene classification.

NorbertZheng commented 1 year ago

Object Detection

Similar to image localization, Faster R-CNN architecture is used with multi-model fusion.

Object Detection mAP Results.

61.82% mAP is obtained.

NorbertZheng commented 1 year ago

Object Detection from Video

Object Detection from Video mAP Results.

Optical flow guided motion prediction is also used to reduce false negative detection. 70.97% mAP is obtained.

NorbertZheng commented 1 year ago

Due to the model diversity, model fusion is effective. By using model fusion, Trimps-Soushen outperforms ResNeXt and PolyNet and got 1st place in image classification. And model fusion is also successfully applied to other tasks.

This means that except the innovative optimization or novel design of network architecture, other technical stuffs such as multi-model fusion can also help to improve the accuracy a lot.

On the other hand, if Trimps-Soushen uses ResNeXt and PolyNet for model fusion, perhaps the error could be further reduced, since ResNeXt and PolyNet obtain higher accuracy compared with those models used for model fusion.

NorbertZheng commented 1 year ago

References

[2016 ECCV] [Trimps-Soushen] (Slides Only) Good Practices for Deep Feature Fusion.

NorbertZheng / read-papers