NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tang | Review: Xception -- With Depthwise Separable Convolution, Better Than Inception-v3 (Image Classification). #113

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: Xception — With Depthwise Separable Convolution, Better Than Inception-v3 (Image Classification).

NorbertZheng commented 1 year ago

Overview

In this story, Xception [1] by Google, stands for Extreme version of Inception, is reviewed. With a modified depthwise separable convolution, it is even better than Inception-v3 [2] (also by Google, 1st Runner Up in ILSVRC 2015) for both ImageNet ILSVRC and JFT datasets. Though it is a 2017 CVPR paper which was just published last year, it’s already had more than 300 citations when I was writing this story.

NorbertZheng commented 1 year ago

Original Depthwise Separable Convolution

image Original Depthwise Separable Convolution.

The original depthwise separable convolution is the depthwise convolution followed by a pointwise convolution.

Compared with conventional convolution, we do not need to perform convolution across all channels. That means the number of connections are fewer and the model is lighter.

NorbertZheng commented 1 year ago

Conventional Convolution vs. Depthwise Convolution.

image The original Depthwise Separable Convolution differs from the Conventional Convolution, as the original Depthwise Separable Convolution has shared convolution filter for multiple feature maps. In the Conventional Convolution, one convolution filter only generates one final feature map!!!

NorbertZheng commented 1 year ago

Modified Depthwise Separable Convolution in Xception

image The Modified Depthwise Separable Convolution used as an Inception Module in Xception, so called “extreme” version of Inception module (n=3 here).

This modification is motivated by the inception module in Inception-v3 that 1×1 convolution is done first before any n×n spatial convolutions. Thus, it is a bit different from the original one. (n=3 here since 3×3 spatial convolutions are used in Inception-v3.)

Two minor differences:

NorbertZheng commented 1 year ago

image The modified depthwise separable convolution with different activation units.

The modified depthwise separable convolution with different activation units are tested. As from the above figure, the Xception without any intermediate activation has the highest accuracy compared with the ones using either ELU or ReLU.

NorbertZheng commented 1 year ago

Overall Architecture

image Overall Architecture of Xception (Entry Flow > Middle Flow > Exit Flow).

As in the figure above, SeparableConv is the modified depthwise separable convolution. We can see that SeparableConvs are treated as Inception Modules and placed throughout the whole deep learning architecture.

And there are residual (or shortcut/skip) connections, originally proposed by ResNet [3], placed for all flows.

image ImageNet: Validation Accuracy Against Gradient Descent Steps.

As seen in the architecture, there are residual connections. Here, it tests for Xception using non-residual version. From the above figure, we can see that the accuracy is much higher when using residual connections.

NorbertZheng commented 1 year ago

Comparison with State-of-the-art Results

2 datasets are tested. One is ILSVRC. One is JFT.

ImageNet — ILSVRC

ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories.

ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3 million training images, 50,000 validation images and 100,000 testing images.

image ImageNet: Xception has the highest accuracy.

Xception outperforms VGGNet [4], ResNet [3], and Inception-v3 [2]. (If interested, please also visit my reviews about them, ads again, lol)

It is noted that, in terms of error rate, not accuracy, the relative improvement is not small!!!

image ImageNet: Validation Accuracy Against Gradient Descent Steps.

Of course, from the above figure, Xception has better accuracy compared with Inception-v3 along the gradient descent steps.

But if we use non-residual version to compare with Inception-v3, Xception underperforms Inception-v3. Should it be better to have a residual version of Inception-v3 for fair comparison? Anyway,

image Model Size/Complexity.

Xception is claimed to have similar model size with Inception-v3.

NorbertZheng commented 1 year ago

JFT — FastEval14k

JFT is an internal Google dataset for large-scale image classification dataset, first introduced by Prof. Hinton et al., which comprises over 350 million high-resolution images annotated with labels from a set of 17,000 classes.

An auxiliary dataset, FastEval14k, is used. FastEval14k is a dataset of 14,000 images with dense annotations from about 6,000 classes (36.5 labels per image on average).

As multiple objects are appeared in one single image densely, mean accuracy prediction (mAP) is used for measurement.

image FastEval14k: Xception has highest mAP@100.

image FastEval14k: Validation Accuracy Against Gradient Descent Steps.

Again, Xception has higher mAP compared with Inception-v3.

NorbertZheng commented 1 year ago

References