NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tang | Review -- Context Encoders: Feature Learning by Inpainting. #122

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — Context Encoders: Feature Learning by Inpainting.

NorbertZheng commented 1 year ago

Overview

image Semantic Inpainting results on held-out images for context encoder trained using reconstruction and adversarial loss.

In this story, Context Encoders: Feature Learning by Inpainting, (Context Encoders), by University of California, Berkeley, is reviewed. In this story:

This is a paper in 2016 CVPR with over 3000 citations.

NorbertZheng commented 1 year ago

Context Encoders for Image Generation

image Context Encoders for Image Generation.

Pipeline

The overall architecture is a simple encoder-decoder pipeline:

It is found to be important to connect the encoder and the decoder through a channel-wise fully-connected layer, which allows each unit in the decoder to reason about the entire image content.

NorbertZheng commented 1 year ago

Channel-wise fully-connected layer or feed-forward layer is always important for domain adaptation!!!

NorbertZheng commented 1 year ago

Encoder

The encoder is derived from the AlexNet.

In contrast to AlexNet, the proposed model is not trained for ImageNet classification; rather, the network is trained for context prediction “from scratch” with randomly initialized weights.

The latent feature dimension is 6x6x256=9216 for both encoder and decoder. Fully connecting the encoder and decoder would result in an explosion in the number of parameters (over 100M!), to the extent that efficient training on current GPUs would be difficult.

NorbertZheng commented 1 year ago

Take the number of parameters into account!!!

NorbertZheng commented 1 year ago

Channel-Wise Fully-Connected Layer

If the input layer has $m$ feature maps of size $n\times n$, this layer will output $m$ feature maps of dimension $n\times n$.

However, unlike a fully-connected layer,

Thus, the number of parameters in this channel-wise fully-connected layer is $mn^{4}$, compared to $m^{2}n^{4}$ parameters in a fully-connected layer (ignoring the bias term).

This is followed by a stride 1 convolution to propagate information across channels.

NorbertZheng commented 1 year ago

Low-Rank RNN???

NorbertZheng commented 1 year ago

Decoder

The channel-wise fully-connected layer is followed by a series of five up-convolutional layers. Each with a rectified linear unit (ReLU) activation function.

An up-convolutional is simply a convolution that results in a higher-resolution image. It can be understood as upsampling followed by convolution, or convolution with fractional stride, i.e. deconvolution.

NorbertZheng commented 1 year ago

Loss function

There are 2 losses, one is reconstruction loss, one is adversarial loss.

Reconstruction Loss

The reconstruction (L2) loss is responsible for

but tends to average together the multiple modes in predictions.

For each ground truth image $x$, the proposed context encoder $F$ produces an output $F(x)$.

Let $\hat{M}$ be a binary mask corresponding to the dropped image region with a value of 1 wherever a pixel was dropped and 0 for input pixels.

The reconstruction loss is a normalized masked L2 distance: image

where $\odot$ is the element-wise product operation.

L1 and L2 losses have no significant difference, it often fails to capture any high-frequency details, often prefer a blurry solution, over highly accurate textures.

Finally, L2 loss is used because it predicts the mean of the distribution, because this minimizes the mean pixel-wise error, but results in a blurry averaged image.

NorbertZheng commented 1 year ago

The reconstruction often prefers low-frequency parts, which are easier to learn.

NorbertZheng commented 1 year ago

The Adversarial Loss

The adversarial loss tries to make the prediction look real and has the effect of picking a particular mode from the distribution.

Only the generator (not the discriminator) is conditioned on context when trained using GAN. The adversarial loss for context encoders, $L_{adv}$, is: image

Both F and D are optimized jointly using alternating SGD.

NorbertZheng commented 1 year ago

Joint Loss

The overall loss function is: image

Currently, an adversarial loss is used only for inpainting experiments as AlexNet.

NorbertZheng commented 1 year ago

Region Masks

image 3 different strategies are proposed: Central region, random block and random region.

Central Region

The simplest such shape is the central square patch in the image.

Random Block

A number of smaller possibly overlapping masks, covering up to 1/4 of the image, is removed.

Random Region

Arbitrary shapes are removed from images.

The random region dropout is used for all our feature based experiments.

NorbertZheng commented 1 year ago

Two CNN Architectures

CNN for Inpainting

image Context Encoder for Inpainting.

CNN for Feature Learning

image Context Encoder for Feature Learning.

NorbertZheng commented 1 year ago

We cannot let the model quickly fall into local optima. We should carefully select Unsupervised learning tasks that are suitable for model learning to obtain useful representation, and the difficulty should not be too great, or they will be simply spoiled!!!

NorbertZheng commented 1 year ago

Inpainting Results

image Comparison with Content-Aware Fill (Photoshop feature based on [2]) on held-out images.

The proposed inpainting performs generally well as in the first figure in the story.

If a region can be filled with low-level textures, texture synthesis methods, such as [2, 11], can often perform better.

image Semantic Inpainting using different methods on held-out images.

image Semantic Inpainting accuracy for Paris StreetView dataset on held-out images.

NorbertZheng commented 1 year ago

Feature Learning Results

image Quantitative comparison for classification, detection and semantic segmentation.

AlexNet trained with the reconstruction loss is used for feature learning.

For semantic segmentation, using proposed context encoders for pretraining (30.0%) outperform a randomly initialized network (19.8%) as well as a plain autoencoder (25.2%) which is trained simply to reconstruct its full input.

NorbertZheng commented 1 year ago

Better than AutoEncoder!!!

NorbertZheng commented 1 year ago

Reference