Open NorbertZheng opened 1 year ago
Each image is treated as a class and projected to hypersphere.
Unsupervised Feature Learning via Non-Parametric Instance Discrimination. Instance Discrimination, by UC Berkeley / ICSI, Chinese University of Hong Kong, and Amazon Rekognition. 2018 CVPR, Over 1100 Citations. Unsupervised Learning, Deep Metric Learning, Self-Supervised Learning, Semi-Supervised Learning, Image Classification, Object Detection.
The pipeline of unsupervised feature learning approach.
The goal is to learn an embedding function $v=f{\theta}(x)$ without supervision. $f{\theta}$ is a deep neural network with parameters $\theta$, mapping image $x$ to feature $v$.
A metric is induced over the image space for instances $x$ and $y$:
A good embedding should map visually similar images closer to each other.
Each image instance is treated as a distinct class of its own and a classifier is trained to distinguish between individual instance classes.
If we got $n$ images/instances, we got $n$ classes.
Under the conventional parametric softmax formulation, for image $x$ with feature $v=f_{\theta}(x)$, the probability of it being recognized as $i$-th example is:
where $w{j}$ is a weight vector for class $j$, and $w{j}^{T}v$ measures how well $v$ matches the $j$-th class, i.e. instance.
A non-parametric variant of the above softmax equation is to replace $w{j}^{T}v$ with $v{j}^{T}v$, and $||v||=1$ is enforced via a L2-normalization layer.
Then the probability $P(i|v)$ becomes:
where $\tau$ is a temperature parameter that controls the concentration level of the distribution (Please feel free to read Distillation for more details about temperature $\tau$). $\tau$ is important for supervised feature learning [43], and also necessary for tuning the concentration of $v$ on the unit sphere.
The learning objective is then to maximize the joint probability:
or equivalently to minimize the negative log-likelihood over the training set:
Getting rid of these weight vectors is important, because the learning objective focuses entirely on the feature representation and its induced metric, which can be applied everywhere in the space and to any new instances at the test time.
Also, it eliminates the need for computing and storing the gradients for $\{w_{j}\}$, making it more scalable for big data applications.
Suitable for scenarios where the number of classes is large. Non-parametric classifier is a kind of contrastive learning???
To compute the probability $P(i|v)$, $\{v_{j}\}$ for all the images are needed. Instead of exhaustively computing these representations every time, a feature memory bank $V$ is maintained for storing them.
Separate notations are introduced for the memory bank and features forwarded from the network. Let $V=\{v{j}\}$ be the memory bank and $f{i}=f{\theta}(x{i})$ be the feature of $x_{i}$.
During each learning iteration, the representation $f_{i}$ as well as the network parameters $\theta$ are optimized via stochastic gradient descend.
Then $f{i}$ is updated to $V$ at the corresponding instance entry $f{i}\to v_{i}$.
All the representations in the memory bank $V$ are initialized as unit random vectors.
Noise-Contrastive Estimation (NCE) is used to approximate full Softmax.
The basic idea is to cast the multi-class classification problem into a set of binary classification problems, where the binary classification tasks is to discriminate between data samples and noise samples.
(NCE is originally used in NLP. Please feel free to read NCE if interested.)
Specifically, the probability that feature representation $v$ in the memory bank corresponds to the $i$-th example under the model is:
where $Z{i}$ is the normalizing constant. The noise distribution is formalized as a uniform distribution: $P{n}=\frac{1}{n}$.
Noise samples are assumed to be $m$ times more frequent than data samples. The posterior probability of sample $i$ with feature $v$ being from the data distribution (denoted by $D=1$) is:
The approximated training objective is to minimize the negative log-posterior distribution of data and noise samples:
Here, $P{d}$ denotes the actual data distribution. For $P{d}$, $v$ is the feature corresponding to $x{i}$; whereas for $P{n}$, $v'$ is the feature from another image, randomly sampled according to noise distribution $P_{n}$.
Both $v$ and $v’$ are sampled from the non-parametric memory bank $V$.
Computing normalizing constant $Z_{i}$ is expensive, Morte Carlo approximation is used:
where $\{j_{k}\}$ is a random subset of indices. Empirically, the approximation derived from initial batches is sufficient to work well in practice.
NCE reduces the computational complexity from O(n) to O(1) per sample.
The effect of proximal regularization.
An additional term is added to encourage the smoothness.
As learning converges, the difference between iterations, i.e. $v(t){i}-v(t-1){i}$, gradually vanishes, and the augmented loss is reduced to the original one.
With proximal regularization, the final objective becomes:
The above figure shows that, empirically, proximal regularization helps stabilize training, speed up convergence, and improve the learned representation, with negligible extra cost.
Change representation smoothly!!!
To classify test image $\hat{x}$, we first compute its feature $\hat{f}=f{\theta}(\hat{x})$, and then compare it against the embeddings of all the images in the memory bank, using the cosine similarity $s{i}=cos(v_{i},\hat{f})$.
The top k nearest neighbors, denoted by $N_{k}$, would then be used to make the prediction via weighted voting.
$\tau=0.07$ and $k=200$.
Sik-Ho Tang. Review — Unsupervised Feature Learning via Non-Parametric Instance Discrimination.