Closed NorbertZheng closed 1 year ago
In this story, Xception [1] by Google, stands for Extreme version of Inception, is reviewed. With a modified depthwise separable convolution, it is even better than Inception-v3 [2] (also by Google, 1st Runner Up in ILSVRC 2015) for both ImageNet ILSVRC and JFT datasets. Though it is a 2017 CVPR paper which was just published last year, it’s already had more than 300 citations when I was writing this story.
Original Depthwise Separable Convolution.
The original depthwise separable convolution is the depthwise convolution followed by a pointwise convolution.
Compared with conventional convolution, we do not need to perform convolution across all channels. That means the number of connections are fewer and the model is lighter.
The original Depthwise Separable Convolution differs from the Conventional Convolution, as the original Depthwise Separable Convolution has shared convolution filter for multiple feature maps. In the Conventional Convolution, one convolution filter only generates one final feature map!!!
The Modified Depthwise Separable Convolution used as an Inception Module in Xception, so called “extreme” version of Inception module (n=3 here).
This modification is motivated by the inception module in Inception-v3 that 1×1 convolution is done first before any n×n spatial convolutions. Thus, it is a bit different from the original one. (n=3 here since 3×3 spatial convolutions are used in Inception-v3.)
Two minor differences:
The modified depthwise separable convolution with different activation units.
The modified depthwise separable convolution with different activation units are tested. As from the above figure, the Xception without any intermediate activation has the highest accuracy compared with the ones using either ELU or ReLU.
Overall Architecture of Xception (Entry Flow > Middle Flow > Exit Flow).
As in the figure above, SeparableConv is the modified depthwise separable convolution. We can see that SeparableConvs are treated as Inception Modules and placed throughout the whole deep learning architecture.
And there are residual (or shortcut/skip) connections, originally proposed by ResNet [3], placed for all flows.
ImageNet: Validation Accuracy Against Gradient Descent Steps.
As seen in the architecture, there are residual connections. Here, it tests for Xception using non-residual version. From the above figure, we can see that the accuracy is much higher when using residual connections.
2 datasets are tested. One is ILSVRC. One is JFT.
ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories.
ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3 million training images, 50,000 validation images and 100,000 testing images.
ImageNet: Xception has the highest accuracy.
Xception outperforms VGGNet [4], ResNet [3], and Inception-v3 [2]. (If interested, please also visit my reviews about them, ads again, lol)
It is noted that, in terms of error rate, not accuracy, the relative improvement is not small!!!
ImageNet: Validation Accuracy Against Gradient Descent Steps.
Of course, from the above figure, Xception has better accuracy compared with Inception-v3 along the gradient descent steps.
But if we use non-residual version to compare with Inception-v3, Xception underperforms Inception-v3. Should it be better to have a residual version of Inception-v3 for fair comparison? Anyway,
Model Size/Complexity.
Xception is claimed to have similar model size with Inception-v3.
JFT is an internal Google dataset for large-scale image classification dataset, first introduced by Prof. Hinton et al., which comprises over 350 million high-resolution images annotated with labels from a set of 17,000 classes.
An auxiliary dataset, FastEval14k, is used. FastEval14k is a dataset of 14,000 images with dense annotations from about 6,000 classes (36.5 labels per image on average).
As multiple objects are appeared in one single image densely, mean accuracy prediction (mAP) is used for measurement.
FastEval14k: Xception has highest mAP@100.
FastEval14k: Validation Accuracy Against Gradient Descent Steps.
Again, Xception has higher mAP compared with Inception-v3.
Sik-Ho Tang. Review: Xception — With Depthwise Separable Convolution, Better Than Inception-v3 (Image Classification).