Sik-Ho Tang | Review: ZFNet -- Winner of ILSVRC 2013 (Image Classification).

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: ZFNet — Winner of ILSVRC 2013 (Image Classification).

NorbertZheng commented 1 year ago

Overview

In this story, ZFNet [1] is reviewed. ZFNet is a kind of winner of the ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2013, which is an image classification competition, which has significantly improvement over AlexNet [2], the winner of ILSVRC 2012.

ILSVRC Ranking.

Some people/articles think that ZFNet is not the winner, this conclusion maybe come from the ranking of ILSVRC, which as shown above. However, Clarifai is the company founded by the author of ZFNet, Zeiler. In addition, according to ImageNet Large Scale Visual Recognition Challenge, it mentioned:

“There were 24 teams participating in the ILSVRC2013 competition, compared to 21 in the previous three years combined. Following the success of the deep learning-based method in 2012, the vast majority of entries in 2013 used deep convolutional neural networks in their submission. The winner of the classification task was Clarifai, with several large deep convolutional networks averaged together. The network architectures were chosen using the visualization technique of (Zeiler and Fergus, 2013),…”

The reference (Zeiler and Fergus, 2013) cited as in the above passage is ZFNet. Thus, it is officially announced that ZFNet is the winner!

This is a 2014 ECCV paper with more than 4000 citations when I was writing this story.

This is an important paper which teaches us to visualize the CNN kernels in deep layers.

ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3 million training images, 50,000 validation images and 100,000 testing images.

15 millions of images.

NorbertZheng commented 1 year ago

Some Facts about Ranking

ILSVRC2013 Ranking [3].

In 2013, ZFNet was invented by Dr. Rob Fergus and his PhD student at that moment, Dr. Matthew D. Zeiler in NYU. (Prof. Yann LeCun, the inventor of LeNet is also from NYU. Hence, they also thanks Prof. LeCun for discussions at the acknowledgement in the paper.) That’s why it is called ZFNet, based on their surname, Zeiler and Fergus, with the paper in 2014 ECCV, called “Visualizing and Understanding Convolutional Networks” [1]. Strictly speaking, ZFNet actually is not the winner of ILSVLC 2013. Instead, Clarifai, which was a new start-up company at that moment, is the winner of ILSVLC 2013 for image classification. And, Zeiler is also the founder and CEO of Clarifai.

As in the figure above,

ZFNet has significantly improved (about 5%) the image classification error rate compared with AlexNet [2], the winner in ILSVRC 2012.

And Clarifai has only small improvement over ZFNet. (For more details about the ranking, please go to [3].) Nevertheless, when we are talking about the deep learning network of the winner of ILSVLC 2013, we usually talk about ZFNet [1].

NorbertZheng commented 1 year ago

What We’ll Cover

How and why convolutional networks can perform so well is always a mystery. Most of the time, we can only reason by intuitive explanation or empirical experiment. In this story, I will cover how ZFNet visualizes the convolutional network. By visualizing the convolutional network, ZFNet become the Winner of ILSVLC 2013 in image classification by fine-tuning the AlexNet invented in 2012. Hence, the sections to be covered:

Deconvnet Techniques for Visualization.
Visualization for Each Layer.
Modifications of AlexNet Based on Visualization Results.
Experimental Results.
Conclusions.

NorbertZheng commented 1 year ago

Deconvnet Techniques for Visualization

As we should know, a standard step in deep learning framework is to have a series of

Conv > Rectification (Activation Function) > Pooling.

To visualize a deep layer feature, we need a set of decovnet techniques to reverse the above actions such that we can visualize the feature in pixel domain.

Unpooling

Unpooling.

Max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region, as in the figure above.

Rectification (Activation Function)

Since ReLU is used as the activation function, and ReLU is to keep all values positive while make negative values become zero. In the reverse operation, we just need to perform ReLU again.

Deconv

Conv (Blue is input, cyan is output).

Deconv (Blue is input, cyan is output).

To do the deconv operation, indeed, it is a transposed version of conv.

NorbertZheng commented 1 year ago

Visualization for Each Layer

Layer 1 and Layer 2.

By using deconv techniques, the top 9 activated patterns in randomly selected feature maps are shown for each layer. And two problems are observed in layer 1 and layer 2.

Filters at layer 1 are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Without the mid frequencies, there is a chain effect that deep features can only learn from extremely high and low frequency information.
Layer 2 shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. Aliasing occurs when the sampling frequency is too low.

Layer 3.

Let us observe 3 more layers.

Layer 3 starts to learn some general patterns, such as mesh patterns, and text pattern.

Layer 4 and Layer 5.

Layer 4 shows significant variation, and is more class-specific, such as dogs’ faces and birds’ legs.
Layer 5 shows entire objects with significant pose variation, such as keyboards and dogs.

NorbertZheng commented 1 year ago

Modifications of AlexNet Based on Visualization Results

ZFNet.

ZFNet is redrawn as the same style of AlexNet for the ease of comparison. To solve the two problems observed in layer 1 and layer 2, ZFNet makes two changes. (To read the AlexNet review, please visit [4].)

Reduced the 1st layer filter size from 11x11 to 7x7.
Made the 1st layer stride of the convolution 2, rather than 4.

Layer 1: (a) More mid-frequencies in ZFNet, (b) Extremely low and high frequencies in AlexNet.

Layer 2: (c) Aliasing artifacts in AlexNet and (d) much cleaner features in ZFNet.

NorbertZheng commented 1 year ago

Experimental Results

The Modified ZFNet based on Ablation Study

Ablation Study.

The Modified ZFNet based on Ablation Study.

There are also ablation study on removing or adjusting layers. The modified ZFNet can obtain 16.0% on top-5 validation error.

NorbertZheng commented 1 year ago

Comparison with State-or-the-art Approaches

Error Rate (%).

By using AlexNet, top-5 validation error rate is 18.1%.
By using ZFNet, top-5 validation error rate is 16.5%. We can conclude that the modifications based on the visualization is essential.
By using 5 ZFNet from (a) and 1 modified ZFNet from (b), top-5 validation error rate is 14.7%. This is again a kind of boosting technique which already used in LeNet and AlexNet. (Please visit [5] and [4] for more about the boosting technique.).

NorbertZheng commented 1 year ago

Other relatively small datasets are also tested

Caltech 101 (83.8 to 86.5 mean accuracy).

Caltech 256 (65.7 to 74.2 mean accuracy).

PASCAL 2012 (79.0 mean accuracy).

From the above tables, we can see that, the accuracy, without pre-training of ZFNet using ImageNet images, i.e. train the ZFNet from the scratch, is low. With the training (fine-tuning) on top of the pre-trained ZFNet, the accuracy is much high. That means

the trained filters are generalized to different images, not just for images for ImageNet.

Particularly for Caltech 101 and Caltech 256 datasets, ZFNet has overwhelming results.

For PASCAL 2012, the PASCAL images can contain multiple objects and quite different from nature compared with those in ImageNet. Thus, the accuracy is a bit lower but still competitive with state-of-the-art approaches.

NorbertZheng commented 1 year ago

Conclusions

While only shallow layer features can be observed previously, this paper provides an interesting approach to observe deep features in pixel domain.

By visualizing the convolutional network layer by layer, ZFNet adjusts the layer hyperparameters such as filter size or stride of the AlexNet and successfully reduces the error rates.

NorbertZheng commented 1 year ago

References

[2014 ECCV] [ZFNet] Visualizing and Understanding Convolutional Networks.
[2012 NIPS] [AlexNet] ImageNet Classification with Deep Convolutional Neural Networks.
ILSVRC 2013 Ranking. http://www.image-net.org/challenges/LSVRC/2013/results.php#cls.
Review of AlexNet, CaffeNet — Winner of ILSVRC 2012 (Image Classification).
Review of LeNet-1, LeNet-4, LeNet-5, Boosted LeNet-4 (Image Classification).

NorbertZheng / read-papers