Sik-Ho Tang | Review -- Context Encoders: Feature Learning by Inpainting.

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — Context Encoders: Feature Learning by Inpainting.

NorbertZheng commented 1 year ago

Overview

Semantic Inpainting results on held-out images for context encoder trained using reconstruction and adversarial loss.

In this story, Context Encoders: Feature Learning by Inpainting, (Context Encoders), by University of California, Berkeley, is reviewed. In this story:

A CNN, called Context Encoders, is trained to generate the contents of an arbitrary image region conditioned on its surroundings.
Both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss, are tested.
The Context Encoders can be used for feature learning. The learned features in pre-training can be used on classification, detection, and segmentation tasks, which is a kind of self-supervised learning.

This is a paper in 2016 CVPR with over 3000 citations.

NorbertZheng commented 1 year ago

Context Encoders for Image Generation

Context Encoders for Image Generation.

Pipeline

The overall architecture is a simple encoder-decoder pipeline:

The encoder takes an input image with missing regions and produces a latent feature representation of that image.
The decoder takes this feature representation and produces the missing image content.

It is found to be important to connect the encoder and the decoder through a channel-wise fully-connected layer, which allows each unit in the decoder to reason about the entire image content.

NorbertZheng commented 1 year ago

Channel-wise fully-connected layer or feed-forward layer is always important for domain adaptation!!!

NorbertZheng commented 1 year ago

Encoder

The encoder is derived from the AlexNet.

Given an input image of size 227x227, the first five convolutional layers and the following pooling layer (called pool5) are used to compute an abstract 6x6x256 dimensional feature representation.

In contrast to AlexNet, the proposed model is not trained for ImageNet classification; rather, the network is trained for context prediction “from scratch” with randomly initialized weights.

The latent feature dimension is 6x6x256=9216 for both encoder and decoder. Fully connecting the encoder and decoder would result in an explosion in the number of parameters (over 100M!), to the extent that efficient training on current GPUs would be difficult.

NorbertZheng commented 1 year ago

Take the number of parameters into account!!!

NorbertZheng commented 1 year ago

Channel-Wise Fully-Connected Layer

If the input layer has $m$ feature maps of size $n\times n$, this layer will output $m$ feature maps of dimension $n\times n$.

However, unlike a fully-connected layer,

it has no parameters connecting different feature maps and only propagates information within feature maps.

Thus, the number of parameters in this channel-wise fully-connected layer is $mn^{4}$, compared to $m^{2}n^{4}$ parameters in a fully-connected layer (ignoring the bias term).

This is followed by a stride 1 convolution to propagate information across channels.

NorbertZheng commented 1 year ago

Low-Rank RNN???

NorbertZheng commented 1 year ago

Decoder

The channel-wise fully-connected layer is followed by a series of five up-convolutional layers. Each with a rectified linear unit (ReLU) activation function.

An up-convolutional is simply a convolution that results in a higher-resolution image. It can be understood as upsampling followed by convolution, or convolution with fractional stride, i.e. deconvolution.

NorbertZheng commented 1 year ago

Loss function

There are 2 losses, one is reconstruction loss, one is adversarial loss.

Reconstruction Loss

The reconstruction (L2) loss is responsible for

capturing the overall structure of the missing region,
and coherence with regards to its context,

but tends to average together the multiple modes in predictions.

For each ground truth image $x$, the proposed context encoder $F$ produces an output $F(x)$.

Let $\hat{M}$ be a binary mask corresponding to the dropped image region with a value of 1 wherever a pixel was dropped and 0 for input pixels.

The reconstruction loss is a normalized masked L2 distance:

where $\odot$ is the element-wise product operation.

L1 and L2 losses have no significant difference, it often fails to capture any high-frequency details, often prefer a blurry solution, over highly accurate textures.

Finally, L2 loss is used because it predicts the mean of the distribution, because this minimizes the mean pixel-wise error, but results in a blurry averaged image.

NorbertZheng commented 1 year ago

The reconstruction often prefers low-frequency parts, which are easier to learn.

NorbertZheng commented 1 year ago

The Adversarial Loss

The adversarial loss tries to make the prediction look real and has the effect of picking a particular mode from the distribution.

Only the generator (not the discriminator) is conditioned on context when trained using GAN. The adversarial loss for context encoders, $L_{adv}$, is:

Both F and D are optimized jointly using alternating SGD.

NorbertZheng commented 1 year ago

Joint Loss

The overall loss function is:

Currently, an adversarial loss is used only for inpainting experiments as AlexNet.

NorbertZheng commented 1 year ago

Region Masks

3 different strategies are proposed: Central region, random block and random region.

Central Region

The simplest such shape is the central square patch in the image.

The network leans low-level image features that latch onto the boundary of the central mask.
Those low-level image features tend not to generalize well to images without masks.

Random Block

A number of smaller possibly overlapping masks, covering up to 1/4 of the image, is removed.

However, the random block masking still has sharp boundaries convolutional features could latch onto.

Random Region

Arbitrary shapes are removed from images.

This can prevent the network from learning low-level features corresponding to the removed mask.
In practice, it is found that random region and random block masks produce a similarly general feature, while significantly outperforming the central region features.

The random region dropout is used for all our feature based experiments.

NorbertZheng commented 1 year ago

Two CNN Architectures

CNN for Inpainting

Context Encoder for Inpainting.

Context encoder trained with joint reconstruction and adversarial loss for semantic inpainting, is as shown above.
Center region dropout is used.

CNN for Feature Learning

Context Encoder for Feature Learning.

Context encoder trained with reconstruction loss for feature learning.
Arbitrary region dropout is used.

NorbertZheng commented 1 year ago

We cannot let the model quickly fall into local optima. We should carefully select Unsupervised learning tasks that are suitable for model learning to obtain useful representation, and the difficulty should not be too great, or they will be simply spoiled!!!

NorbertZheng commented 1 year ago

Inpainting Results

Comparison with Content-Aware Fill (Photoshop feature based on [2]) on held-out images.

The proposed inpainting performs generally well as in the first figure in the story.

If a region can be filled with low-level textures, texture synthesis methods, such as [2, 11], can often perform better.

Semantic Inpainting using different methods on held-out images.

Semantic Inpainting accuracy for Paris StreetView dataset on held-out images.

Nearest neighbor inpainting (NN) is compared.
The proposed reconstructions are well-aligned semantically.

NorbertZheng commented 1 year ago

Feature Learning Results

Quantitative comparison for classification, detection and semantic segmentation.

AlexNet trained with the reconstruction loss is used for feature learning.

A random Gaussian initialization performs roughly 25% below an ImageNet-trained model; however, it does not use any labels.
Context encoders are competitive with concurrent self-supervised feature learning methods [7, 39] (Context Prediction [7]) and significantly outperform autoencoders and Agrawal et al. [1].

For semantic segmentation, using proposed context encoders for pretraining (30.0%) outperform a randomly initialized network (19.8%) as well as a plain autoencoder (25.2%) which is trained simply to reconstruct its full input.

NorbertZheng commented 1 year ago

Better than AutoEncoder!!!

NorbertZheng commented 1 year ago

Reference

[2016 CVPR] [Context Encoders] Context Encoders: Feature Learning by Inpainting.

NorbertZheng / read-papers