Sik-Ho Tang | Review -- Colorful Image Colorization (Self-Supervised Learning).

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — Colorful Image Colorization (Self-Supervised Learning).

NorbertZheng commented 1 year ago

Overview

Colorization as Pretext Task in Self-Supervised Learning, Outperforms Context Prediction & Context Encoders.

Example Input Grayscale Photos and Output Colorizations.

In this story, Colorful Image Colorization, by University of California, Berkeley, is reviewed. In this paper:

A fully automatic approach is designed, which produces vibrant and realistic colorizations for a grayscale image.
This proposed colorization task is treated as a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder.

It is a paper in 2016 ECCV with over 1900 citations.

NorbertZheng commented 1 year ago

Multi-Modality!!!

NorbertZheng commented 1 year ago

Colorful Image Colorization

Colorful Image Colorization: Network Architecture.

The color space used is CIE Lab color space.
Given an input lightness channel $X$ (L), the objective is to learn a mapping to the two associated color channels $Y$ (ab).

Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm layer. The net has no pool layers, only spatial downsampling or upsampling is used between conv blocks if needed.

L2 Loss is NOT Robust

Using naïve L2 loss as shown below is not robust to the inherent ambiguity and multimodal nature of the colorization problem: For example,

Multiple possible solutions: If an object can take on a set of distinct ab values, the optimal solution to the Euclidean loss will be the mean of the set. This averaging effect favors grayish, desaturated results!!!
Also due to multiple possible solutions: Additionally, if the set of plausible colorizations is non-convex, the solution will in fact be out of the set, giving implausible results.

NorbertZheng commented 1 year ago

Multiple possible solutions lead to

the optimal solution converges to the averaging solution, favoring grayish, desaturated results.
the optimal solution is implausible, which is out of such set, if non-convex.

NorbertZheng commented 1 year ago

Multinomial Classification Loss

The problem is treated as a multinomial classification. Quantized ab color space with a grid size of 10

The ab output space is quantized into bins with grid size 10 and keeps the $Q=313$ values which are in-gamut, as above.

For a given input $X$, we learn a mapping $\hat{Z}=G(X)$ to a probability distribution over possible colors: where $Q$ is the number of quantized ab values.

$Z$ is the vector converted from the ground truth color $Y$, using a soft-encoding scheme (i.e. not exactly one-hot, easier to learn meaningful embeddings!!!):

The 5-nearest neighbors to $Y_{h,w}$ in the output space are selected and weighted proportionally to their distance from the ground truth using a Gaussian kernel with $\sigma=5$.

Thus, the multinomial cross-entropy loss is: where $v(\cdot)$ is a weighting term that can be used to rebalance the loss based on color-class rarity. Each pixel is weighted by factor $w$ based on its closest ab bin, and $\lambda=1/2$.

The reason to reweight is that the distribution of ab values in natural images is strongly biased towards values with low ab values, due to the appearance of backgrounds such as clouds, pavement, dirt, and walls.
The ground truth using loss function is dominated by desaturated ab values.

To be brief, the class-imbalance problem is achieved by reweighting the loss of each pixel at train time based on the pixel color rarity as above.

Finally, we map probability distribution $\hat{Z}$ to color values $\hat{Y}$ with function: This mapping $H$ is mentioned as below in Section 1.3.

NorbertZheng commented 1 year ago

Use classification loss to reduce the difficulty of the pretext task???

NorbertZheng commented 1 year ago

Class Probabilities to Point Estimates

Effect of temperature parameter T on the annealed-mean output.

$H$ is defined to map the predicted distribution $\hat{Z}$ to point estimate $\hat{Y}$ in ab space:

$T=0$: One choice is to take the mode of the predicted distribution for each pixel. But this provides a vibrant but sometimes spatially inconsistent result.
$T=1$: On the other hand, taking the mean of the predicted distribution produces spatially consistent but desaturated results.

The temperature $T=0.38$, shown in the middle column of the above figure, captures the vibrancy of the mode while maintaining the spatial coherence of the mean.

(The introduction of temperature $T$ into Softmax can be referenced from Model Distillation)

Hence, the final system $F$ is the composition of CNN $G$, which produces a predicted distribution over all pixels, and the annealed-mean operation $H$, which produces a final prediction. (The system is NOT quite end-to-end trainable.)

NorbertZheng commented 1 year ago

NOT end-to-end trainable!!!

NorbertZheng commented 1 year ago

Colorization Results

Colorization results on 10k images in the ImageNet validation set.

The 1.3M (i.e. 1'300'000) images from the ImageNet training set are used for training, the first 10k (i.e. 10'000) images in the ImageNet validation set are used for validation, and a separate 10k (i.e. 10'000) images in the validation set are used for testing.

Ours (full): Proposed network with classification loss and class rebalancing.
Ours (class): Proposed network with classification loss only, no class rebalancing. ($\lambda=1)
Ours (L2): Proposed network with L2 regression loss, trained from scratch.
Ours (L2, ft): Proposed network with L2 regression loss, fine-tuned from full classification with rebalancing network.

Perceptual realism (AMT)

A real vs. fake two-alternative forced choice experiment is ran on Amazon Mechanical Turk (AMT). 40 Participants were asked to click on the photo they believed contained fake colors.

The proposed full algorithm fooled participants on 32% of trials, which is significantly higher than all compared algorithms. These results validate the effectiveness of using both

classification loss,
and class rebalancing.

NorbertZheng commented 1 year ago

Class rebalancing is just like an adversarial loss with an inductive bias towards colorization!!!

NorbertZheng commented 1 year ago

Semantic Interpretability (VGG Classification)

It is tested by feeding our fake colorized images to a VGG. If the classifier performs well, that means the colorizations are accurate.

Classifier performance drops from 68.3% to 52.7% after ablating colors from the input. After re-colorizing using our full method, the performance is improved to 56.0%.

NorbertZheng commented 1 year ago

Use machines, instead of humans, to check the generative performance.

NorbertZheng commented 1 year ago

Legacy Black and White Photos

Applying the proposed method to legacy black and white photos.

The proposed model is still able to produce good colorizations, even though the low-level image statistics of the legacy photographs are quite different from those of the modern-day photos.

NorbertZheng commented 1 year ago

Zero-shot!!!

NorbertZheng commented 1 year ago

More Examples

More examples and results are in the appendix of the paper.

NorbertZheng commented 1 year ago

Self-Supervised Learning Results

The colorization approach serves as a pretext task for representation learning.

The network model is akin to an autoencoder, except that the input and output are different image channels, suggesting the term cross-channel encoder.

Left: ImageNet Linear Classification, Right: PASCAL Tests.

ImageNet Classification

The pre-trained networks are frozen and linear classifiers are learnt on top of the each convolutional layer for ImageNet classification.

Pretraining are performed without semantic label information.

AlexNet directly trained on ImageNet classification achieves the highest performance, and serves as the ceiling for this test.

The proposed method outperforms Gaussian, k-means, and Context Encoders [10].

Solving the colorization task encourages representations that linearly separate semantic classes in the trained data distribution.

NorbertZheng commented 1 year ago

PASCAL Classification

The network is trained by freezing the representation up to certain points, and fine-tuning the remainder.

The network is effectively only able to interpret grayscale images.

PASCAL Detection

Fast R-CNN framework is used.

The proposed method outperforms k-means. However, it is inferior to Context Prediction [14].

PASCAL Segmentation

FCN architecture is used.

The proposed grayscale fine-tuned network achieves performance of 35.0%, approximately equal to Donahue et al. [16], and adding in color information increases performance to 35.6% (how???).

NorbertZheng commented 1 year ago

By learning the colorization as pretext task without ground truth labels, useful features are learnt, which can be used for downstream tasks, such as image classification, detection, and segmentation.

NorbertZheng commented 1 year ago

Reference

[2016 ECCV] [Colorization] Colorful Image Colorization.

NorbertZheng / read-papers