Sik-Ho Tsang | Brief Review -- Learning Deep Representations by Mutual Information Estimation and Maximization.

NorbertZheng commented 1 year ago

Sik-Ho Tsang. Brief Review — Learning Deep Representations by Mutual Information Estimation and Maximization.

NorbertZheng commented 1 year ago

Overview

Deep InfoMax (DIM) with Global & Local Objectives.

Learning Deep Representations by Mutual Information Estimation and Maximization, Deep InfoMax (DIM), by Microsoft Research 2019 ICLR, Over 1800 Citations. Self-Supervised Learning, Contrastive Learning, Image Classification.
Self-supervised representation learning is based on maximizing mutual information between features extracted from multiple views of a shared context.
While multiple views could be produced by
- observing it from different locations (e.g., camera positions within a scene),
- and via different modalities (e.g., tactile, auditory, or visual),
- an ImageNet image could provide a context from which one produces multiple views by repeatedly applying data augmentation.
This is a paper from Prof. Bengio research group.

NorbertZheng commented 1 year ago

Deep InfoMax (DIM)

Let $X$ and $Y$ be the domain, e.g. images, and range of a continuous, e.g. feature vector, and (almost everywhere) differentiable parametric function, $E_{\psi}: X\rightarrow Y$ with parameters $\psi$.
The encoder should be trained such that the mutual information is maximized:
- Find the set of parameters, $\psi$, such that the mutual information, $I(X;E_{\psi}(X))$, is maximized.
Depending on the end-goal, this maximization can be done over the complete input, $X$, or some structured or “local” subset.
To maximize MI, a discriminator is trained to classify if the feature vector is real or fake.

NorbertZheng commented 1 year ago

Global DIM: DIM(G)

Deep InfoMax (DIM) with a global $MI(X; Y)$ objective.

The above shows Deep InfoMax (DIM) with a global $MI(X, Y)$ objective.
Here, Both the high-level feature vector $Y$, and the lower-level $M\times M$ feature map through a discriminator to get the score.
Fake samples are drawn by combining the same feature vector with a $M\times M$ feature map from another image.

NorbertZheng commented 1 year ago

Mutual Information Network Estimation

Donsker-Varadhan (DV)

One of the approaches follows Mutual Information Neural Estimation (MINE) (Belghazi et al., 2018), which uses a lower-bound to the MI based on the Donsker-Varadhan representation (DV, Donsker & Varadhan, 1983) of the KL-divergence:

where $T_{\omega}: X\times Y$ is a discriminator function modeled by a neural network with parameters $\omega$.

At a high level, $E$ is optimized by simultaneously estimating and maximizing $I(X;E_{\psi}(X))$:

where the subscript $G$ denotes “global”.

Jensen-Shannon Divergence (JSD)

Jensen-Shannon MI estimator (following the formulation of Nowozin et al., 2016):

where $x$ is an input sample, $x$' is an input sampled from $\hat{\mathbb{P}}=\mathbb{P}$, and $sp(z)=log(1+e^{z})$ is the softplus function.

Noise-Contrastive Estimation (NCE)

Similar to NCE or InfoNCE in CPC, this loss can also be used with DIM by maximizing:

It is found that using InfoNCE often outperforms JSD on downstream tasks.

NorbertZheng commented 1 year ago

Local DIM: DIM(L)

Maximizing mutual information between local features and global features.

First, the image is encoded to a feature map $C_{\psi}(x)$ that reflects some structural aspect of the data, e.g. spatial locality, and this feature map is further summarized into a global feature vector.
Then, this feature vector is concatenated with the lower-level feature map at every location.
A score is produced for each local-global pair through an additional function.
MI estimator is performed on global/local pairs, maximizing the average estimated MI:

NorbertZheng commented 1 year ago

Prior Matching

“Real” samples are drawn from a prior while “fake” samples from the encoder output are sent to a discriminator.
The discriminator is trained to distinguish between (classify) these sets of samples. The encoder is trained to “fool” the discriminator.
With prior matching, encoder can generate features that close to prior.