Sik-Ho Tang | Review: Xception -- With Depthwise Separable Convolution, Better Than Inception-v3 (Image Classification).

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: Xception — With Depthwise Separable Convolution, Better Than Inception-v3 (Image Classification).

NorbertZheng commented 1 year ago

Overview

In this story, Xception [1] by Google, stands for Extreme version of Inception, is reviewed. With a modified depthwise separable convolution, it is even better than Inception-v3 [2] (also by Google, 1st Runner Up in ILSVRC 2015) for both ImageNet ILSVRC and JFT datasets. Though it is a 2017 CVPR paper which was just published last year, it’s already had more than 300 citations when I was writing this story.

NorbertZheng commented 1 year ago

Original Depthwise Separable Convolution

Original Depthwise Separable Convolution.

The original depthwise separable convolution is the depthwise convolution followed by a pointwise convolution.

Depthwise convolution is the channel-wise $n\times n$ spatial convolution. Suppose in the figure above, we have 5 channels, then we will have 5 $n\times n$ spatial convolution.
Pointwise convolution actually is the $1\times 1$ convolution to change the dimension.

Compared with conventional convolution, we do not need to perform convolution across all channels. That means the number of connections are fewer and the model is lighter.

NorbertZheng commented 1 year ago

Conventional Convolution vs. Depthwise Convolution.

The original Depthwise Separable Convolution differs from the Conventional Convolution, as the original Depthwise Separable Convolution has shared convolution filter for multiple feature maps. In the Conventional Convolution, one convolution filter only generates one final feature map!!!

But in the original Depthwise Separable Convolution, multiple combinations of intermediate feature maps are supported through the following $1\times 1$ convolution.

NorbertZheng commented 1 year ago

Modified Depthwise Separable Convolution in Xception

The Modified Depthwise Separable Convolution used as an Inception Module in Xception, so called “extreme” version of Inception module (n=3 here).

The modified depthwise separable convolution is the pointwise convolution followed by a depthwise convolution.

This modification is motivated by the inception module in Inception-v3 that 1×1 convolution is done first before any n×n spatial convolutions. Thus, it is a bit different from the original one. (n=3 here since 3×3 spatial convolutions are used in Inception-v3.)

Two minor differences:

~~The order of operations~~: As mentioned, the original depthwise separable convolutions as usually implemented (e.g. in TensorFlow) perform first channel-wise spatial convolution and then perform $1\times 1$ convolution whereas the modified depthwise separable convolution perform $1\times 1$ convolution first then channel-wise spatial convolution.
- This is claimed to be unimportant because when it is used in stacked setting, there are only small differences appeared at the beginning and at the end of all the chained inception modules.
The Presence/Absence of Non-Linearity: In the original Inception Module, there is non-linearity after first operation. In Xception, the modified depthwise separable convolution, there is NO intermediate ReLU non-linearity.

NorbertZheng commented 1 year ago

The modified depthwise separable convolution with different activation units.

The modified depthwise separable convolution with different activation units are tested. As from the above figure, the Xception without any intermediate activation has the highest accuracy compared with the ones using either ELU or ReLU.

NorbertZheng commented 1 year ago

Overall Architecture

Overall Architecture of Xception (Entry Flow > Middle Flow > Exit Flow).

As in the figure above, SeparableConv is the modified depthwise separable convolution. We can see that SeparableConvs are treated as Inception Modules and placed throughout the whole deep learning architecture.

E.g. The number of filters after $1\times 1$ convolution is fixed, greatly reducing the number of trainable parameters!!!

And there are residual (or shortcut/skip) connections, originally proposed by ResNet [3], placed for all flows.

ImageNet: Validation Accuracy Against Gradient Descent Steps.

As seen in the architecture, there are residual connections. Here, it tests for Xception using non-residual version. From the above figure, we can see that the accuracy is much higher when using residual connections.

Thus, the residual connection is extremely important !!!

NorbertZheng commented 1 year ago

Comparison with State-of-the-art Results

2 datasets are tested. One is ILSVRC. One is JFT.

ImageNet — ILSVRC

ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories.

ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3 million training images, 50,000 validation images and 100,000 testing images.

ImageNet: Xception has the highest accuracy.

Xception outperforms VGGNet [4], ResNet [3], and Inception-v3 [2]. (If interested, please also visit my reviews about them, ads again, lol)

It is noted that, in terms of error rate, not accuracy, the relative improvement is not small!!!

ImageNet: Validation Accuracy Against Gradient Descent Steps.

Of course, from the above figure, Xception has better accuracy compared with Inception-v3 along the gradient descent steps.

But if we use non-residual version to compare with Inception-v3, Xception underperforms Inception-v3. Should it be better to have a residual version of Inception-v3 for fair comparison? Anyway,

Xception tells us that with both Depthwise Separable Convolution and Residual Connections, it really helps to improve the accuracy.

Model Size/Complexity.

Xception is claimed to have similar model size with Inception-v3.

NorbertZheng commented 1 year ago

JFT — FastEval14k

JFT is an internal Google dataset for large-scale image classification dataset, first introduced by Prof. Hinton et al., which comprises over 350 million high-resolution images annotated with labels from a set of 17,000 classes.

An auxiliary dataset, FastEval14k, is used. FastEval14k is a dataset of 14,000 images with dense annotations from about 6,000 classes (36.5 labels per image on average).

As multiple objects are appeared in one single image densely, mean accuracy prediction (mAP) is used for measurement.

FastEval14k: Xception has highest mAP@100.

FastEval14k: Validation Accuracy Against Gradient Descent Steps.

Again, Xception has higher mAP compared with Inception-v3.

NorbertZheng commented 1 year ago

References

[2017 CVPR] [Xception] Xception: Deep Learning with Depthwise Separable Convolutions.
[2016 CVPR] [Inception-v3] Rethinking the Inception Architecture for Computer Vision.
[2016 CVPR] [ResNet] Deep Residual Learning for Image Recognition.
[2015 ICLR] [VGGNet] Very Deep Convolutional Networks for Large-Scale Image Recognition.

NorbertZheng / read-papers