Sik-Ho Tang | Review -- Unsupervised Feature Learning via Non-Parametric Instance Discrimination.

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — Unsupervised Feature Learning via Non-Parametric Instance Discrimination.

NorbertZheng commented 1 year ago

Overview

Each image is treated as a class and projected to hypersphere.

Unsupervised Feature Learning via Non-Parametric Instance Discrimination. Instance Discrimination, by UC Berkeley / ICSI, Chinese University of Hong Kong, and Amazon Rekognition. 2018 CVPR, Over 1100 Citations. Unsupervised Learning, Deep Metric Learning, Self-Supervised Learning, Semi-Supervised Learning, Image Classification, Object Detection.

Authors start by asking a question: “Can we learn a good feature representation that captures apparent similarity among instances, instead of classes, by merely asking the feature to be discriminative of individual instances?”
This intuition is formulated as a non-parametric classification problem at the instance-level, and
- use noise-contrastive estimation (NCE) to tackle the computational challenges imposed by the large number of instance classes.

NorbertZheng commented 1 year ago

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

The pipeline of unsupervised feature learning approach.

Goal

A backbone CNN is used to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized.
The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere.

The goal is to learn an embedding function $v=f{\theta}(x)$ without supervision. $f{\theta}$ is a deep neural network with parameters $\theta$, mapping image $x$ to feature $v$.

A metric is induced over the image space for instances $x$ and $y$:

A good embedding should map visually similar images closer to each other.

Each image instance is treated as a distinct class of its own and a classifier is trained to distinguish between individual instance classes.

NorbertZheng commented 1 year ago

Parametric Classifier: Conventional Softmax

If we got $n$ images/instances, we got $n$ classes.

Under the conventional parametric softmax formulation, for image $x$ with feature $v=f_{\theta}(x)$, the probability of it being recognized as $i$-th example is:

where $w{j}$ is a weight vector for class $j$, and $w{j}^{T}v$ measures how well $v$ matches the $j$-th class, i.e. instance.

NorbertZheng commented 1 year ago

Proposed Non-Parametric Softmax Classifier

A non-parametric variant of the above softmax equation is to replace $w{j}^{T}v$ with $v{j}^{T}v$, and $||v||=1$ is enforced via a L2-normalization layer.

Then the probability $P(i|v)$ becomes:

where $\tau$ is a temperature parameter that controls the concentration level of the distribution (Please feel free to read Distillation for more details about temperature $\tau$). $\tau$ is important for supervised feature learning [43], and also necessary for tuning the concentration of $v$ on the unit sphere.

The learning objective is then to maximize the joint probability:

or equivalently to minimize the negative log-likelihood over the training set:

NorbertZheng commented 1 year ago

Getting rid of these weight vectors is important, because the learning objective focuses entirely on the feature representation and its induced metric, which can be applied everywhere in the space and to any new instances at the test time.

Also, it eliminates the need for computing and storing the gradients for $\{w_{j}\}$, making it more scalable for big data applications.

NorbertZheng commented 1 year ago

Suitable for scenarios where the number of classes is large. Non-parametric classifier is a kind of contrastive learning???

NorbertZheng commented 1 year ago

Learning with A Memory Bank and NCE

Memory Bank

To compute the probability $P(i|v)$, $\{v_{j}\}$ for all the images are needed. Instead of exhaustively computing these representations every time, a feature memory bank $V$ is maintained for storing them.

Separate notations are introduced for the memory bank and features forwarded from the network. Let $V=\{v{j}\}$ be the memory bank and $f{i}=f{\theta}(x{i})$ be the feature of $x_{i}$.

During each learning iteration, the representation $f_{i}$ as well as the network parameters $\theta$ are optimized via stochastic gradient descend.

Then $f{i}$ is updated to $V$ at the corresponding instance entry $f{i}\to v_{i}$.

All the representations in the memory bank $V$ are initialized as unit random vectors.

NorbertZheng commented 1 year ago

Noise-Contrastive Estimation (NCE)

Noise-Contrastive Estimation (NCE) is used to approximate full Softmax.

The basic idea is to cast the multi-class classification problem into a set of binary classification problems, where the binary classification tasks is to discriminate between data samples and noise samples.

(NCE is originally used in NLP. Please feel free to read NCE if interested.)

Specifically, the probability that feature representation $v$ in the memory bank corresponds to the $i$-th example under the model is:

where $Z{i}$ is the normalizing constant. The noise distribution is formalized as a uniform distribution: $P{n}=\frac{1}{n}$.

Noise samples are assumed to be $m$ times more frequent than data samples. The posterior probability of sample $i$ with feature $v$ being from the data distribution (denoted by $D=1$) is:

The approximated training objective is to minimize the negative log-posterior distribution of data and noise samples:

Here, $P{d}$ denotes the actual data distribution. For $P{d}$, $v$ is the feature corresponding to $x{i}$; whereas for $P{n}$, $v'$ is the feature from another image, randomly sampled according to noise distribution $P_{n}$.

Both $v$ and $v’$ are sampled from the non-parametric memory bank $V$.

Computing normalizing constant $Z_{i}$ is expensive, Morte Carlo approximation is used:

where $\{j_{k}\}$ is a random subset of indices. Empirically, the approximation derived from initial batches is sufficient to work well in practice.

NCE reduces the computational complexity from O(n) to O(1) per sample.

NorbertZheng commented 1 year ago

Proximal Regularization

The effect of proximal regularization.

During each training epoch, each class is only visited once. Therefore, the learning process oscillates a lot from random sampling fluctuation.

An additional term is added to encourage the smoothness.

At current iteration $t$, the feature representation for data $x{i}$ is computed from the network $v(t){i}=f{\theta}(x{i})$. The memory bank of all the representation are stored at previous iteration $V=\{v(t-1)\}$.
The loss function for a positive sample from $P_{d}$ is:

As learning converges, the difference between iterations, i.e. $v(t){i}-v(t-1){i}$, gradually vanishes, and the augmented loss is reduced to the original one.

With proximal regularization, the final objective becomes:

The above figure shows that, empirically, proximal regularization helps stabilize training, speed up convergence, and improve the learned representation, with negligible extra cost.

NorbertZheng commented 1 year ago

Change representation smoothly!!!

NorbertZheng commented 1 year ago

Weighted K-Nearest Neighbor Classifier

To classify test image $\hat{x}$, we first compute its feature $\hat{f}=f{\theta}(\hat{x})$, and then compare it against the embeddings of all the images in the memory bank, using the cosine similarity $s{i}=cos(v_{i},\hat{f})$.

The top k nearest neighbors, denoted by $N_{k}$, would then be used to make the prediction via weighted voting.

$\tau=0.07$ and $k=200$.

NorbertZheng / read-papers