Closed NorbertZheng closed 1 year ago
In this story, ZFNet [1] is reviewed. ZFNet is a kind of winner of the ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2013, which is an image classification competition, which has significantly improvement over AlexNet [2], the winner of ILSVRC 2012.
ILSVRC Ranking.
Some people/articles think that ZFNet is not the winner, this conclusion maybe come from the ranking of ILSVRC, which as shown above. However, Clarifai is the company founded by the author of ZFNet, Zeiler. In addition, according to ImageNet Large Scale Visual Recognition Challenge, it mentioned:
The reference (Zeiler and Fergus, 2013) cited as in the above passage is ZFNet. Thus, it is officially announced that ZFNet is the winner!
This is a 2014 ECCV paper with more than 4000 citations when I was writing this story.
ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3 million training images, 50,000 validation images and 100,000 testing images.
15 millions of images.
ILSVRC2013 Ranking [3].
In 2013, ZFNet was invented by Dr. Rob Fergus and his PhD student at that moment, Dr. Matthew D. Zeiler in NYU. (Prof. Yann LeCun, the inventor of LeNet is also from NYU. Hence, they also thanks Prof. LeCun for discussions at the acknowledgement in the paper.) That’s why it is called ZFNet, based on their surname, Zeiler and Fergus, with the paper in 2014 ECCV, called “Visualizing and Understanding Convolutional Networks” [1]. Strictly speaking, ZFNet actually is not the winner of ILSVLC 2013. Instead, Clarifai, which was a new start-up company at that moment, is the winner of ILSVLC 2013 for image classification. And, Zeiler is also the founder and CEO of Clarifai.
As in the figure above,
And Clarifai has only small improvement over ZFNet. (For more details about the ranking, please go to [3].) Nevertheless, when we are talking about the deep learning network of the winner of ILSVLC 2013, we usually talk about ZFNet [1].
How and why convolutional networks can perform so well is always a mystery. Most of the time, we can only reason by intuitive explanation or empirical experiment. In this story, I will cover how ZFNet visualizes the convolutional network. By visualizing the convolutional network, ZFNet become the Winner of ILSVLC 2013 in image classification by fine-tuning the AlexNet invented in 2012. Hence, the sections to be covered:
As we should know, a standard step in deep learning framework is to have a series of
To visualize a deep layer feature, we need a set of decovnet techniques to reverse the above actions such that we can visualize the feature in pixel domain.
Unpooling.
Max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region, as in the figure above.
Since ReLU is used as the activation function, and ReLU is to keep all values positive while make negative values become zero. In the reverse operation, we just need to perform ReLU again.
Conv (Blue is input, cyan is output).
Deconv (Blue is input, cyan is output).
To do the deconv operation, indeed, it is a transposed version of conv.
Layer 1 and Layer 2.
By using deconv techniques, the top 9 activated patterns in randomly selected feature maps are shown for each layer. And two problems are observed in layer 1 and layer 2.
Layer 3.
Let us observe 3 more layers.
Layer 4 and Layer 5.
ZFNet.
ZFNet is redrawn as the same style of AlexNet for the ease of comparison. To solve the two problems observed in layer 1 and layer 2, ZFNet makes two changes. (To read the AlexNet review, please visit [4].)
Layer 1: (a) More mid-frequencies in ZFNet, (b) Extremely low and high frequencies in AlexNet.
Layer 2: (c) Aliasing artifacts in AlexNet and (d) much cleaner features in ZFNet.
Ablation Study.
The Modified ZFNet based on Ablation Study.
There are also ablation study on removing or adjusting layers. The modified ZFNet can obtain 16.0% on top-5 validation error.
Error Rate (%).
Caltech 101 (83.8 to 86.5 mean accuracy).
Caltech 256 (65.7 to 74.2 mean accuracy).
PASCAL 2012 (79.0 mean accuracy).
From the above tables, we can see that, the accuracy, without pre-training of ZFNet using ImageNet images, i.e. train the ZFNet from the scratch, is low. With the training (fine-tuning) on top of the pre-trained ZFNet, the accuracy is much high. That means
Particularly for Caltech 101 and Caltech 256 datasets, ZFNet has overwhelming results.
For PASCAL 2012, the PASCAL images can contain multiple objects and quite different from nature compared with those in ImageNet. Thus, the accuracy is a bit lower but still competitive with state-of-the-art approaches.
While only shallow layer features can be observed previously, this paper provides an interesting approach to observe deep features in pixel domain.
By visualizing the convolutional network layer by layer, ZFNet adjusts the layer hyperparameters such as filter size or stride of the AlexNet and successfully reduces the error rates.
Sik-Ho Tang. Review: ZFNet — Winner of ILSVRC 2013 (Image Classification).