Lecture 7: Self-Supervised Learning

YOUTUBE

Reconstruct From A Corrupted Version

Denoising Autoencoder

Gaussian Noise
Masking Noise
Salt-and-pepper Noise: The masked fraction => random 0/1 value.

Loss Func:

Stacked Denoising Autoencoder: Add noise to the internal feature vector.

Context Encoder

Mask out a rectangular region from an image. => Reconstruct the actual image.

Predicting One View From Another

L => AB

Visual Common Sense Tasks

Relative Position of Image Patches

Center patch + Other Patch => The relative position of the "other patch".

Solving Jigsaw Puzzles.

Rotation

Predict neighboring context

Word2Vec.

SkipGram: One 2 neighbors.
CBOW: Neighbors to One.

Contrastive Learning

CPC: Contrastive Predictive Coding

Associated data sequence(audio, image, whatever) => the future data.

We have: positive samples: Grabbed from the raw data. (maybe a crop from the raw image) negative samples: Unrelated data from the dataset. (a crop from some other images in the dataset)

For each input(say, an image), we have (N-1) negative samples & 1 positive sample.

CPC uses an RNN encoder to encode the input sequence into a context vector(c_t) as the high-level feature(also called slow feature).

The goal is to maximize the mutual information between c_t and z_positive while minimizing that between c_t and z_negative.

Instance Discrimination

Do classification at the instance level. (Every image is a class.)

MoCo & SimCLR.

Memory Bank(2018)

Challenge: It is impossible to have every image occupying a class in the real world. (u don't have that much memory)

But we can have a memory bank to store the feature vectors.

Pipeline: A batch of images -> feature map -> L2 128-dim vector -> Non-Parametric Softmax Classifier -> The probability of positive for each image, stored in the memory bank. (128D Unit Sphere)

Problem:

MoCo(2020)

The task in training: Which key is responsible for the query?

作者提出建立dictionary依赖两个必要条件：1. large，dictionary的大小需要足够大，才能对高维、连续空间进行很好的表达；2. consistent，dictionary的key需要使用相同或者相似的encoder进行编码，这样query和key之间的距离度量才能够一致并且有意义。

E2E(The parameters for encoding the keys in the dictionary remains the same): Good consistency but cannot scale. (consistent but not large) Memory bank (large but not consistent)

Task: Can we have both scalability + consistency?

2 networks:

An encoder to encode the query: Query Encoder.
Another encoder to encode the keys: Momentum Encoder.

At the start:

There are K keys(negative samples) in the momentum queue.

Pipeline:

Augumentation: x => (x, k_+) where x is the query(maybe batched). Then compare the query with the K keys in the queue using N-pair contrastive loss. Back prop: Update query encoder + momentum encoder. Finally, put this batch into the queue.

For momentum encoder:

$\theta_{\mathrm{k}} \leftarrow m \theta_{\mathrm{k}}+(1-m) \theta_{\mathrm{q}}$

ganler / ResearchReading

UC Berkeley -- CS294-158 20Spring: Deep Unsupervised Learning #3