Closed NorbertZheng closed 1 year ago
Representation Learning Using InfoNCE Loss. In this story, Representation Learning with Contrastive Predictive Coding, (CPC/CPCv1), by DeepMind, is reviewed. In this paper:
This is a paper in 2018 arXiv with over 1800 citations. This paper makes use of NCE and Negative Sampling in NLP for representation learning/self-supervised learning.
Unimodal losses such as means squared error and cross-entropy are not very useful, modeling $p(x|c)$ directly may not be optimal, where target $x$ (future) and context $c$ (present).
For example, images may contain thousands of bits of information while the high-level latent variables such as the class label contain much less information (10 bits for 1,024 categories).
However, unsupervised learning is yet to see a breakthrough as well.
Predictive coding has been used for long time in data compression.
The main intuition of the model is to learn the representations that encode the underlying shared information between different parts of the (high-dimensional) signal. At the same time it discards low-level information and noise that is more local.
When predicting future information, CPC instead encode the target $x$ (future) and context $c$ (present) into a compact distributed vector representations in a way that maximally preserves the mutual information of the original signals $x$ and $c$ defined as:
As argued in the previous section, we do not predict future observations $x{t+k}$ directly with a generative model $f{k}\propto (x{t+k}|c{t})$. Instead we model a density ratio which preserves the mutual information between $x{t+k}$ and $c{t}$:
A linear transformation $W{k}^{T}c{t}$ is used for the prediction with a different $W_{k}$ for every step $k$.
In the proposed model, either of $z{t}$ and $c{t}$ could be used as representation for downstream tasks.
Both the encoder and autoregressive model are trained to jointly optimize a loss based on NCE, called InfoNCE.
Given a set $X=\{x{1},...,x{N}\}$ of $N$ random samples containing one positive sample from $p(x{t+k}|c{t})$ and $N-1$ negative samples from the "proposal" distribution $p(x_{t+k})$.
Recall in the previous section that:
Optimizing this loss will result in $f{k}(x{t+l},c_{t})$ estimating the density ratio mentioned in the previous section.
The optimal probability for this loss is written as $p(d=i|X,c{t})$ with $[d=i]$ being the indicator that sample $x{i}$ is the "positive" sample.
The probability that sample $x{i}$ was drawn from the conditional distribution $p(x{t+k}|c{t})$ rather than the proposal distribution $p(x{t+k})$ can be derived as follows:
As seen, the optimal value for $f{k}(x{t+k},c_{t})$ is independent of the the choice of the number of negative samples $N-1$.
Minimizing the InfoNCE loss $L{n}$, is actually maximizing a lower bound on the mutual information $I(x{t+k}, c_{t})$:
The proof (By splitting $X$ into the positive example and the negative examples $X_{neg}$):
(InfoNCE loss is highly related to NCE & Negative Sampling used in NLP. Please feel free to read it if interested.)
LibriSpeech phone and speaker classification results. For phone classification there are 41 possible classes and for speaker classification 251.
For phone classification, CPC obtains 64.6% accuracy. When a single hidden layer is used instead, the accuracy increases from 64.6% to 72.5%, which is closer to the accuracy of the fully supervised model.
Interestingly, CPCs capture both speaker identity and speech contents, as demonstrated by the good accuracies attained with a simple linear classifier, which also gets close to the oracle, fully supervised networks.
LibriSpeech phone classification ablation experiments.
t-SNE visualization of audio (speech) representations for a subset of 10 speakers (out of 251).
Visualization of Contrastive Predictive Coding for images.
Every row shows image patches that activate a certain neuron in the CPC architecture.
ImageNet top-1 unsupervised classification results.
ImageNet top-5 unsupervised classification results.
Classification accuracy on five common NLP benchmarks ([40] is Doc2Vec).
The performance of our method is very similar to the Skip-Thought vector model [26], with the advantage that it does not require a powerful LSTM as word-level decoder, therefore much faster to train.
Reinforcement Learning results for 5 DeepMind Lab tasks Black: batched A2C baseline, Red: with auxiliary contrastive loss. 5 reinforcement learning in 3D environments of DeepMind Lab [51] are tested: rooms_watermaze, explore_goal_locations_small, seekavoid_arena_01, lasertag_three_opponents_small and rooms_keys_doors_puzzle. The standard batched A2C [52] agent is used as base model.
Later on, CPCv2 is published in 2020 ICLR, hope I have time to review it in the coming future.
[2018 arXiv] [CPC/CPCv1] Representation Learning with Contrastive Predictive Coding
Self-Supervised Learning 2008–2010 [Stacked Denoising Autoencoders] 2014 [Exemplar-CNN] 2015 [Context Prediction] 2016 [Context Encoders] [Colorization] [Jigsaw Puzzles] 2017 [L³-Net] [Split-Brain Auto] [Mean Teacher] 2018 [RotNet/Image Rotations] [DeepCluster] [CPC/CPCv1]
Sik-Ho Tsang. Review: Representation Learning with Contrastive Predictive Coding (CPC/CPCv1).