Sik-Ho Tang | Review -- Exemplar-CNN: Discriminative Unsupervised Feature Learning with Convolutional Neural Networks (Self-Supervised Learning).

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — Exemplar-CNN: Discriminative Unsupervised Feature Learning with Convolutional Neural Networks (Self-Supervised Learning).

NorbertZheng commented 1 year ago

Overview

Exemplar-CNN: Trained on Unlabeled Data Using Surrogate Class by Data Transformation. Surrogate classes are generated by data transformation using unlabeled data.

In this story, Discriminative Unsupervised Feature Learning with Convolutional Neural Networks, (Exemplar-CNN), by University of Freiburg, is reviewed. In this paper:

Surrogate classes are generated using unlabeled data.
Each surrogate class is formed by applying a variety of transformations to a randomly sampled “seed” image patch.
CNN is trained to discriminate between a set of surrogate classes.

This is a paper in 2014 NIPS with over 600 citations.

NorbertZheng commented 1 year ago

Creating Surrogate Training Data & Learning Algorithm

Random transformation is applied to patches. All transformed patches from the same original “seed” image, are having the same surrogate class as the original “seed” images.

If there are 8000 “seed” images, then there are 8000 surrogate classes.

NorbertZheng commented 1 year ago

Data Augmentation in SimCLR???

NorbertZheng commented 1 year ago

Creating Surrogate Training Data

The input to the training procedure is a set of unlabeled images.

$N\in [50,32000]$ patches of size $32\times 32$ pixels are randomly sampled from different images at varying positions and scales forming the initial training set $X=\{x{1}, …, x{N}\}$.

We are interested in patches containing objects or parts of objects, hence we sample only from regions containing considerable gradients (a kind of prior!!!).

NorbertZheng commented 1 year ago

A family of transformations $\{T_{\alpha} | \alpha\in A\}$ is defined parameterized by vectors $\in A$, where $A$ is the set of all possible parameter vectors. Each transformation $T$ is a composition of elementary transformations from the following list:

Translation: vertical or horizontal translation by a distance within 0.2 of the patch size;
Scaling: the patch is scale by a factor between 0.7 and 1.4;
Rotation: rotation of the image by an angle up to 20 degrees;
Contrast 1: multiply the projection of each patch pixel onto the principal components of the set of all pixels by a factor between 0.5 and 2.
Contrast 2: raise saturation and value (S and V components of the HSV) of all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply these values by a factor between 0.7 and 1.4, add to them a value between -0.1 and 0.1;
Color: add a value between -0.1 and 0.1 to the hue (H component of the HSV) of all pixels in the patch.

NorbertZheng commented 1 year ago

For each initial patch $x{i}\in X, K\in [1,300]$ random parameter vectors $\{\alpha{i}^{1},...,\alpha_{i}^{K}\}$ are sampled:

And the corresponding transformations $\{T{\alpha{i}^{1}},...,T{\alpha{i}^{K}}\}$ to the patch $x_{i}$. (i.e., to be brief, applying random transformation to each patch)

This yields the set of its transformed versions $S{x{i}}=T{i}x{i}=\{Tx{i}|T\in T{i}\}$.

Afterwards, the mean of each pixel over the whole resulting dataset are subtracted, and no any other preprocessing.

Exemplary patches sampled from the STL unlabeled dataset which are later augmented by various transformations to obtain surrogate data for the CNN training.

Exemplary patches sampled from the STL-10 unlabeled dataset are shown above.

Several random transformations applied to one of the patches extracted from the STL unlabeled dataset. The original (’seed’) patch is in the top left corner.

Examples of transformed versions of one patch are shown above.

NorbertZheng commented 1 year ago

Just like SimCLR, the augmentation techniques are pre-defined, such augmentation techniques are assumed to be irrelevant to the true semantic meanings.

This supports multi-view!!!

NorbertZheng commented 1 year ago

Learning Algorithm

With surrogated class generated, CNN can be trained.

A CNN is trained to discriminate between these surrogate classes.
Formally, we minimize the following loss function: each of these sets to be a class by assigning label $i$ to the class $S{x{i}}$.

where $l(i,T{x{i}})$ is the loss on the transformed sample $T{x{i}}$ with (surrogate) true label $i$.

Intuitively, the classification problem described above serves to ensure that

different input samples can be distinguished.
it enforces invariance to the specified transformations.

After training the CNN using unlabeled dataset, the CNN features are pooled are used to train a linear SVM for the target dataset, which will be mentioned in more details as below.

NorbertZheng commented 1 year ago

CNN Architectures & Experimental Setup

Unlabeled Dataset for Surrogate Class

STL is especially well suited for unsupervised learning as it contains a large set of 100,000 unlabeled samples.
Surrogate training data is extracted from unlabeled subset of STL-10.

Two CNNs

Two networks are used: One is small and one is big.

A “small” network: consists of two convolutional layers with 64 filters each followed by a fully connected layer with 128 neurons.
A “large” network: consists of three convolutional layers with 64, 128 and 256 filters respectively followed by a fully connected layer with 512 neurons.

All convolution is $5\times 5$ filters. $2\times 2$ max pooling is used after the first and second convolutions. Dropout is applied to the fully connected layer.

NorbertZheng commented 1 year ago

Really swallow convolution network!!! Just like the architecture we designed in naive_cnn.

NorbertZheng commented 1 year ago

Pooled-Features for Linear SVM

For STL-10 and CIFAR-10, to each feature map, 4-quadrant max-pooling, resulting in 4 values per feature map, is used.
For Caltech-101, 3-layer spatial pyramid, i.e. max-pooling over the whole image as well as within 4 quadrants and within the cells of a $4\times 4$ grid, resulting in $1 + 4 + 16 = 21$ values per feature map, is used.
A linear support vector machine (SVM) is trained on the pooled features.

NorbertZheng commented 1 year ago

Experimental Results

SOTA Comparison

Classification accuracies on several datasets

The features extracted from the larger network match or outperform the best prior result on all datasets.

This is despite the fact that

the dimensionality of the feature vector is smaller than that of most other approaches,
the networks are trained on the STL-10 unlabeled dataset (i.e. they are used in a transfer learning manner when applied to CIFAR-10 and Caltech 101).

NorbertZheng commented 1 year ago

Number of Surrogate Classes

Influence of the number of surrogate training classes.

The number $N$ of surrogate classes is varied between 50 and 32000.

The classification accuracy increases with the number of surrogate classes until it reaches an optimum at about 8000 surrogate classes after which it did not change or even decreased.

This is to be expected: the larger the number of surrogate classes, the more likely it is to draw very similar or even identical samples, which are hard or impossible to discriminate.
This also demonstrates the main limitation of our approach to randomly sample “seed” patches: it does not scale to arbitrarily large amounts of unlabeled data.

NorbertZheng commented 1 year ago

Number of Samples per Surrogate Class

Classification performance on STL for different numbers of samples per class.

The classification accuracy is shown when the number $K$ of training samples per surrogate class varies between 1 and 300.
As seen, if the number of samples is too small, there is insufficient data to learn the desired invariance properties.

The performance improves with more samples per surrogate class and saturates at around 100 samples.

NorbertZheng commented 1 year ago

Types of Transformations

Influence of removing groups of transformations during generation of the surrogate training data.

The value “0” corresponds to applying random compositions of all elementary transformations: scaling, rotation, translation, color variation, and contrast variation.

Different columns of the plot show the difference in classification accuracy as we discarded some types of elementary transformations.

First, rotation and scaling have only a minor impact on the performance, while translations, color variations and contrast variations are significantly more important.
Secondly, the results on STL-10 and CIFAR-10 consistently show that spatial invariance and color-contrast invariance are approximately of equal importance for the classification performance.
Thirdly, on Caltech-101, color and contrast transformations are much more important compared to spatial transformations than on the two other datasets, since Caltech-101 images are often well aligned, and this dataset bias makes spatial invariance less useful.

NorbertZheng commented 1 year ago

Reference

[2014 NIPS] [Exemplar-CNN] Discriminative Unsupervised Feature Learning with Convolutional Neural Networks.

NorbertZheng / read-papers