NorbertZheng / read-papers

My paper reading notes.
MIT License
8 stars 0 forks source link

Sik-Ho Tang | Review -- Unsupervised Feature Learning via Non-Parametric Instance Discrimination. #134

Open NorbertZheng opened 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — Unsupervised Feature Learning via Non-Parametric Instance Discrimination.

NorbertZheng commented 1 year ago

Overview

image Each image is treated as a class and projected to hypersphere.

Unsupervised Feature Learning via Non-Parametric Instance Discrimination. Instance Discrimination, by UC Berkeley / ICSI, Chinese University of Hong Kong, and Amazon Rekognition. 2018 CVPR, Over 1100 Citations. Unsupervised Learning, Deep Metric Learning, Self-Supervised Learning, Semi-Supervised Learning, Image Classification, Object Detection.

NorbertZheng commented 1 year ago

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

image The pipeline of unsupervised feature learning approach.

Goal

The goal is to learn an embedding function $v=f{\theta}(x)$ without supervision. $f{\theta}$ is a deep neural network with parameters $\theta$, mapping image $x$ to feature $v$.

A metric is induced over the image space for instances $x$ and $y$: image

A good embedding should map visually similar images closer to each other.

Each image instance is treated as a distinct class of its own and a classifier is trained to distinguish between individual instance classes.

NorbertZheng commented 1 year ago

Parametric Classifier: Conventional Softmax

If we got $n$ images/instances, we got $n$ classes.

Under the conventional parametric softmax formulation, for image $x$ with feature $v=f_{\theta}(x)$, the probability of it being recognized as $i$-th example is: image

where $w{j}$ is a weight vector for class $j$, and $w{j}^{T}v$ measures how well $v$ matches the $j$-th class, i.e. instance.

NorbertZheng commented 1 year ago

Proposed Non-Parametric Softmax Classifier

A non-parametric variant of the above softmax equation is to replace $w{j}^{T}v$ with $v{j}^{T}v$, and $||v||=1$ is enforced via a L2-normalization layer.

Then the probability $P(i|v)$ becomes: image

where $\tau$ is a temperature parameter that controls the concentration level of the distribution (Please feel free to read Distillation for more details about temperature $\tau$). $\tau$ is important for supervised feature learning [43], and also necessary for tuning the concentration of $v$ on the unit sphere.

The learning objective is then to maximize the joint probability: image

or equivalently to minimize the negative log-likelihood over the training set: image

NorbertZheng commented 1 year ago

Getting rid of these weight vectors is important, because the learning objective focuses entirely on the feature representation and its induced metric, which can be applied everywhere in the space and to any new instances at the test time.

Also, it eliminates the need for computing and storing the gradients for $\{w_{j}\}$, making it more scalable for big data applications.

NorbertZheng commented 1 year ago

Suitable for scenarios where the number of classes is large. Non-parametric classifier is a kind of contrastive learning???

NorbertZheng commented 1 year ago

Learning with A Memory Bank and NCE

Memory Bank

To compute the probability $P(i|v)$, $\{v_{j}\}$ for all the images are needed. Instead of exhaustively computing these representations every time, a feature memory bank $V$ is maintained for storing them.

Separate notations are introduced for the memory bank and features forwarded from the network. Let $V=\{v{j}\}$ be the memory bank and $f{i}=f{\theta}(x{i})$ be the feature of $x_{i}$.

During each learning iteration, the representation $f_{i}$ as well as the network parameters $\theta$ are optimized via stochastic gradient descend.

Then $f{i}$ is updated to $V$ at the corresponding instance entry $f{i}\to v_{i}$.

All the representations in the memory bank $V$ are initialized as unit random vectors.

NorbertZheng commented 1 year ago

Noise-Contrastive Estimation (NCE)

Noise-Contrastive Estimation (NCE) is used to approximate full Softmax.

The basic idea is to cast the multi-class classification problem into a set of binary classification problems, where the binary classification tasks is to discriminate between data samples and noise samples.

(NCE is originally used in NLP. Please feel free to read NCE if interested.)

Specifically, the probability that feature representation $v$ in the memory bank corresponds to the $i$-th example under the model is: image

where $Z{i}$ is the normalizing constant. The noise distribution is formalized as a uniform distribution: $P{n}=\frac{1}{n}$.

Noise samples are assumed to be $m$ times more frequent than data samples. The posterior probability of sample $i$ with feature $v$ being from the data distribution (denoted by $D=1$) is: image

The approximated training objective is to minimize the negative log-posterior distribution of data and noise samples: image

Here, $P{d}$ denotes the actual data distribution. For $P{d}$, $v$ is the feature corresponding to $x{i}$; whereas for $P{n}$, $v'$ is the feature from another image, randomly sampled according to noise distribution $P_{n}$.

Both $v$ and $v’$ are sampled from the non-parametric memory bank $V$.

Computing normalizing constant $Z_{i}$ is expensive, Morte Carlo approximation is used: image

where $\{j_{k}\}$ is a random subset of indices. Empirically, the approximation derived from initial batches is sufficient to work well in practice.

NCE reduces the computational complexity from O(n) to O(1) per sample.

NorbertZheng commented 1 year ago

Proximal Regularization

image The effect of proximal regularization.

An additional term is added to encourage the smoothness.

As learning converges, the difference between iterations, i.e. $v(t){i}-v(t-1){i}$, gradually vanishes, and the augmented loss is reduced to the original one.

With proximal regularization, the final objective becomes: image

The above figure shows that, empirically, proximal regularization helps stabilize training, speed up convergence, and improve the learned representation, with negligible extra cost.

NorbertZheng commented 1 year ago

Change representation smoothly!!!

NorbertZheng commented 1 year ago

Weighted K-Nearest Neighbor Classifier

To classify test image $\hat{x}$, we first compute its feature $\hat{f}=f{\theta}(\hat{x})$, and then compare it against the embeddings of all the images in the memory bank, using the cosine similarity $s{i}=cos(v_{i},\hat{f})$.

The top k nearest neighbors, denoted by $N_{k}$, would then be used to make the prediction via weighted voting.

image

image

$\tau=0.07$ and $k=200$.