Closed NorbertZheng closed 1 year ago
Semantic Inpainting results on held-out images for context encoder trained using reconstruction and adversarial loss.
In this story, Context Encoders: Feature Learning by Inpainting, (Context Encoders), by University of California, Berkeley, is reviewed. In this story:
This is a paper in 2016 CVPR with over 3000 citations.
Context Encoders for Image Generation.
The overall architecture is a simple encoder-decoder pipeline:
It is found to be important to connect the encoder and the decoder through a channel-wise fully-connected layer, which allows each unit in the decoder to reason about the entire image content.
Channel-wise fully-connected layer or feed-forward layer is always important for domain adaptation!!!
The encoder is derived from the AlexNet.
In contrast to AlexNet, the proposed model is not trained for ImageNet classification; rather, the network is trained for context prediction “from scratch” with randomly initialized weights.
The latent feature dimension is 6x6x256=9216 for both encoder and decoder. Fully connecting the encoder and decoder would result in an explosion in the number of parameters (over 100M!), to the extent that efficient training on current GPUs would be difficult.
Take the number of parameters into account!!!
If the input layer has $m$ feature maps of size $n\times n$, this layer will output $m$ feature maps of dimension $n\times n$.
However, unlike a fully-connected layer,
Thus, the number of parameters in this channel-wise fully-connected layer is $mn^{4}$, compared to $m^{2}n^{4}$ parameters in a fully-connected layer (ignoring the bias term).
This is followed by a stride 1 convolution to propagate information across channels.
Low-Rank RNN???
The channel-wise fully-connected layer is followed by a series of five up-convolutional layers. Each with a rectified linear unit (ReLU) activation function.
An up-convolutional is simply a convolution that results in a higher-resolution image. It can be understood as upsampling followed by convolution, or convolution with fractional stride, i.e. deconvolution.
There are 2 losses, one is reconstruction loss, one is adversarial loss.
The reconstruction (L2) loss is responsible for
but tends to average together the multiple modes in predictions.
For each ground truth image $x$, the proposed context encoder $F$ produces an output $F(x)$.
Let $\hat{M}$ be a binary mask corresponding to the dropped image region with a value of 1 wherever a pixel was dropped and 0 for input pixels.
The reconstruction loss is a normalized masked L2 distance:
where $\odot$ is the element-wise product operation.
L1 and L2 losses have no significant difference, it often fails to capture any high-frequency details, often prefer a blurry solution, over highly accurate textures.
Finally, L2 loss is used because it predicts the mean of the distribution, because this minimizes the mean pixel-wise error, but results in a blurry averaged image.
The reconstruction often prefers low-frequency parts, which are easier to learn.
The adversarial loss tries to make the prediction look real and has the effect of picking a particular mode from the distribution.
Only the generator (not the discriminator) is conditioned on context when trained using GAN. The adversarial loss for context encoders, $L_{adv}$, is:
Both F and D are optimized jointly using alternating SGD.
The overall loss function is:
Currently, an adversarial loss is used only for inpainting experiments as AlexNet.
3 different strategies are proposed: Central region, random block and random region.
The simplest such shape is the central square patch in the image.
A number of smaller possibly overlapping masks, covering up to 1/4 of the image, is removed.
Arbitrary shapes are removed from images.
The random region dropout is used for all our feature based experiments.
Context Encoder for Inpainting.
Context Encoder for Feature Learning.
We cannot let the model quickly fall into local optima. We should carefully select Unsupervised learning tasks that are suitable for model learning to obtain useful representation, and the difficulty should not be too great, or they will be simply spoiled!!!
Comparison with Content-Aware Fill (Photoshop feature based on [2]) on held-out images.
The proposed inpainting performs generally well as in the first figure in the story.
If a region can be filled with low-level textures, texture synthesis methods, such as [2, 11], can often perform better.
Semantic Inpainting using different methods on held-out images.
Semantic Inpainting accuracy for Paris StreetView dataset on held-out images.
Quantitative comparison for classification, detection and semantic segmentation.
AlexNet trained with the reconstruction loss is used for feature learning.
For semantic segmentation, using proposed context encoders for pretraining (30.0%) outperform a randomly initialized network (19.8%) as well as a plain autoencoder (25.2%) which is trained simply to reconstruct its full input.
Better than AutoEncoder!!!
Sik-Ho Tang. Review — Context Encoders: Feature Learning by Inpainting.