Sik-Ho Tang | Review: Maxout Network (Image Classification).

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: Maxout Network (Image Classification).

NorbertZheng commented 1 year ago

Overview

Inthis story, Maxout Network, by Université de Montréal, is briefly reviewed. Ian J. GoodFellow, the first author, is also the inventor of Generative Adversarial Network (GAN). And Yoshua Bengio, the last author, just got the Turing Award recently this year (2019), which is the “Nobel Prize of computing”. These two authors plus the second last author Aaron Courville, three authors together, has also published the book “Deep learning”, through publisher MIT Press, in 2016 . And in this paper, for maxout network, it is published in 2013 ICML with over 1500 citations.

NorbertZheng commented 1 year ago

Maxout

An MLP containing two maxout units.

Given an input $x$, or hidden layer’s state $v$, $z$ is:

$$ z{ij}=x^{T}W{...ij}+b_{ij}, $$

And a new type of activation function is used:

$$ h{i}(x)=\max{j\in [1,k]}z_{ij}, $$

At last $g$ is:

$$ g(v)=h{1}(v)-h{2}(v), $$

The philosophy behind is that:

Any continuous PWL function can be expressed as a difference of two convex PWL functions.
Any continuous function can be approximated arbitrarily well, by a piecewise linear function.
And it can be achieved by a maxout network with two hidden units $h{1}(v)$ and $h{2}(v)$, with sufficiently large $k$.
And it found that a two hidden unit maxout network can approximate any continuous function $f(v)$ arbitrarily well on the compact domain.

NorbertZheng commented 1 year ago

Explanations of Maxout in NIN

As NIN compares with Maxout in an intensive way in the experimental result section, NIN also explains a little bit for Maxout Network.
Maximization over linear functions makes a piecewise linear approximator which is capable of approximating any convex functions.
The maxout network is more potent as it can separate concepts that lie within convex sets.
However, maxout network imposes the prior that instances of a latent concept lie within a convex set in the input space, which does not necessarily hold.

NorbertZheng commented 1 year ago

Results

MNIST

Test error on permutation invariant MNIST.

The MNIST (LeCun et al., 1998) dataset consists of 28×28 pixel greyscale images of handwritten digits 0–9, with 60,000 training and 10,000 test examples.
The last 10,000 training examples are used as validation set.
A model consisting of two densely connected maxout layers followed by a softmax layer is trained.
0.94% test error is obtained, which is the best result that does not use unsupervised pretraining.

Test error on MNIST.

Three convolutional maxout hidden layers (with spatial max pooling on top of the maxout layers) followed by a densely connected softmax layer is used.
A test set error rate of 0.45%, which is the best result.

NorbertZheng commented 1 year ago

CIFAR-10

Test error on CIFAR-10.

The CIFAR-10 dataset (Krizhevsky & Hinton, 2009) consists of 32 × 32 color images drawn from 10 classes split into 50,000 train and 10,000 test images.
The model used consists of three convolutional maxout layers, a fully connected maxout layer, and a fully connected softmax layer.
A test set error of 11.68% is obtained.
With data augmentation, i.e. translations and horizontal reflection, a test set error of 9.38% is obtained.

With dropout, a greater than 25% reduction is achieved in the validation set error on CIFAR-10.

NorbertZheng commented 1 year ago

CIFAR-100

CIFAR-100.

Test error on CIFAR-100.

The CIFAR-100 (Krizhevsky & Hinton, 2009) dataset is the same size and format as the CIFAR-10 dataset, but contains 100 classes, with only one tenth as many labeled examples per class.
A test error of 38.57% is obtained.

NorbertZheng commented 1 year ago

Street View House Numbers (SVHN)

Test error on SVHN.

Each image is of size 32×32 and the task is to classify the digit in the center of the image.
There are 73,257 digits in the training set, 26,032 digits in the test set and 531,131 additional, somewhat less difficult examples, to use as an extra training set.
400 samples per class from the training set and 200 samples per class from the extra set are selected. The remaining digits of the train and extra sets are used for training.
The model used consists of three convolutional maxout hidden layers and a densely connected maxout layer followed by a densely connected softmax layer.
A test error of 2.47% is obtained.

NorbertZheng commented 1 year ago

It is interesting to read through the paper to see how authors make use of neural network to achieve the propositions above. And there are also ablation studies of Maxout Network against other activation functions like Tanh or ReLU at the end of the paper.

NorbertZheng commented 1 year ago

Reference

[2013 ICML] [Maxout] Maxout Network.

NorbertZheng / read-papers