NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tang | Review: Representation Learning with Contrastive Predictive Coding (CPC/CPCv1). #133

Closed NorbertZheng closed 10 months ago

NorbertZheng commented 10 months ago

Sik-Ho Tang. Review: Representation Learning with Contrastive Predictive Coding (CPC/CPCv1).

NorbertZheng commented 10 months ago

Overview

Representation Learning Using InfoNCE Loss.

image Google DeepMind.

In this story, Representation Learning with Contrastive Predictive Coding, (CPC/CPCv1), by DeepMind, is reviewed. In this paper:

This is a paper in 2018 arXiv with over 1800 citations. This paper makes use of NCE and Negative Sampling in NLP for representation learning/self-supervised learning.

NorbertZheng commented 10 months ago

Motivation and Intuition of Contrastive Predictive Coding (CPC)

Unimodal losses such as means squared error and cross-entropy are not very useful,

For example, images may contain thousands of bits of information while the high-level latent variables such as the class label contain much less information (10 bits for 1,024 categories).

However, unsupervised learning is yet to see a breakthrough as well.

The main intuition of the model is to learn the representations that encode the underlying shared information between different parts of the (high-dimensional) signal. At the same time it discards low-level information and noise that is more local.

When predicting future information, CPC instead encode the target $x$ (future) and context $c$ (present) into a compact distributed vector representations in a way that maximally preserves the mutual information of the original signals $x$ and $c$ defined as: image

NorbertZheng commented 10 months ago

Contrastive Predictive Coding (CPC): Overview

image Contrastive Predictive Coding (CPC): Overview (Although this figure shows audio as input, we use the same setup for images, text and reinforcement learning).

Model:

As argued in the previous section, we do not predict future observations $x{t+k}$ directly with a generative model $p{k}(x{t+k}|c{t})$. Instead we model a density ratio which preserves the mutual information between $x{t+k}$ and $c{t}$: image

Score-base Model???

A simple log-bilinear model can be used: image

A linear transformation $W{k}^{T}c{t}$ is used for the prediction with a different $W_{k}$ for every step $k$.

In the proposed model, either of $z{t}$ and $c{t}$ could be used as representation for downstream tasks.

NorbertZheng commented 10 months ago

InfoNCE Loss and Mutual Information Estimation

Both the encoder and autoregressive model are trained to jointly optimize a loss based on NCE, called InfoNCE.

Given a set $X=\{x{1}, …, x{N}\}$ of $N$ random samples containing one positive sample from $p(x{t+k}|c{t})$ and $N-1$ negative samples from the ‘proposal’ distribution $p(x_{t+k})$ image

Recall in the previous section that: image

Optimizing this loss will result in $f{k}(x{t+k},c_{t})$ estimating the density ratio mentioned in the previous section.

The probability that sample $x{i}$ was drawn from the conditional distribution $p(x{t+k}|c{t})$ rather than the proposal distribution $p(x{t+k})$ can be derived as follows: image

As seen, the optimal value for $f(x{t+k},c{t})$ is independent of the the choice of the number of negative samples $N-1$. E.g. independent of the denominator.

Minimizing the InfoNCE loss $L{n}$, is actually maximizing a lower bound on the mutual information $I(x{t+k},c_{t})$: image

The proof (By splitting $X$ into the positive example and the negative examples $X_{neg}$): image

(InfoNCE loss is highly related to NCE & Negative Sampling used in NLP. Please feel free to read it if interested.)

NorbertZheng commented 10 months ago

Experiments for Audio

Pretext Task

A 100-hour subset of the publicly available LibriSpeech dataset is used which does not have labels, but only raw text.

Authors have made the aligned phone labels and our train/test split. The dataset contains speech from 251 different speakers.

Model:

There is a feature vector for every 10ms of speech.

A minibatch of 8 examples from which the negative samples in the contrastive loss are drawn.

12 timesteps in the future are predicted using the contrastive loss. image Average accuracy of predicting the positive sample in the contrastive loss for 1 to 20 latent steps in the future of a speech waveform.

The prediction task becomes harder as the target is further away.

Downstream Task

image LibriSpeech phone and speaker classification results. For phone classification there are 41 possible classes and for speaker classification 251.

To understand the representations extracted by CPC, the phone prediction performance is measured with a linear classifier trained on top of these features.

Model:

Results:

image LibriSpeech phone classification ablation experiments.

Some other ablation experiments are also performed, i.e. how many steps to predict, predicting 12 steps obtains the best CPC representations.

Also, where the negative samples are drawn from (this part, all are to predcit 12 steps) are also tested. (excl. means excluding negative samples from the current sequence.) It is found that

image t-SNE visualization of audio (speech) representations for a subset of 10 speakers (out of 251).

NorbertZheng commented 10 months ago

Just like EEG & MEG across different subjects???

NorbertZheng commented 10 months ago

Experiments for Vision

Pretext Task

image Visualization of Contrastive Predictive Coding for images.

Model:

image Every row shows image patches that activate a certain neuron in the CPC architecture.

NorbertZheng commented 10 months ago

Downstream Task

image ImageNet top-1 unsupervised classification results.

image ImageNet top-5 unsupervised classification results.

CPC outperforms Context Prediction, Colorization, Jigsaw Puzzles, BiGAN, etc.

Despite being relatively domain agnostic (after all the model is proposed for sequence model), CPCs improve upon state-of-the-art by 9% absolute in top-1 accuracy, and 4% absolute in top-5 accuracy.

NorbertZheng commented 10 months ago

Experiments for Natural Language

Pretext Task

Downstream Task

image Classification accuracy on five common NLP benchmarks ([40] is Doc2Vec).

For the classification tasks, the following datasets are used: movie review sentiment (MR) [43], customer product reviews (CR) [44], subjectivity/objectivity [45], opinion polarity (MPQA) [46] and question-type classification (TREC) [47].

A logistic regression classifier is trained.

The performance of our method is very similar to the Skip-Thought vector model [26], with the advantage that it does not require a powerful LSTM as word-level decoder, therefore much faster to train.

NorbertZheng commented 10 months ago

Experiments for Reinforcement Learning

image Reinforcement Learning results for 5 DeepMind Lab tasks Black: batched A2C baseline, Red: with auxiliary contrastive loss.

5 reinforcement learning in 3D environments of DeepMind Lab [51] are tested: rooms_watermaze, explore_goal_locations_small, seekavoid_arena_01, lasertag_three_opponents_small and rooms_keys_doors_puzzle.

The standard batched A2C [52] agent is used as base model.

The unroll length for the A2C is 100 steps and we predict up to 30 steps in the future to derive the contrastive loss.

For 4 out of the 5 games, performance of the agent improves significantly with the contrastive loss after training on 1 billion frames.

NorbertZheng commented 10 months ago

Later on, CPCv2 is published in 2020 ICLR, hope I have time to review it in the coming future.

NorbertZheng commented 10 months ago

Reference