NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tang | Review -- Colorful Image Colorization (Self-Supervised Learning). #123

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — Colorful Image Colorization (Self-Supervised Learning).

NorbertZheng commented 1 year ago

Overview

Colorization as Pretext Task in Self-Supervised Learning, Outperforms Context Prediction & Context Encoders.

image Example Input Grayscale Photos and Output Colorizations.

In this story, Colorful Image Colorization, by University of California, Berkeley, is reviewed. In this paper:

It is a paper in 2016 ECCV with over 1900 citations.

NorbertZheng commented 1 year ago

Multi-Modality!!!

NorbertZheng commented 1 year ago

Colorful Image Colorization

image Colorful Image Colorization: Network Architecture.

Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm layer. The net has no pool layers, only spatial downsampling or upsampling is used between conv blocks if needed.

L2 Loss is NOT Robust

Using naïve L2 loss as shown below is not robust to the inherent ambiguity and multimodal nature of the colorization problem: image For example,

NorbertZheng commented 1 year ago

Multiple possible solutions lead to

NorbertZheng commented 1 year ago

Multinomial Classification Loss

The problem is treated as a multinomial classification. image Quantized ab color space with a grid size of 10

The ab output space is quantized into bins with grid size 10 and keeps the $Q=313$ values which are in-gamut, as above.

For a given input $X$, we learn a mapping $\hat{Z}=G(X)$ to a probability distribution over possible colors: image where $Q$ is the number of quantized ab values.

$Z$ is the vector converted from the ground truth color $Y$, using a soft-encoding scheme (i.e. not exactly one-hot, easier to learn meaningful embeddings!!!): image

The 5-nearest neighbors to $Y_{h,w}$ in the output space are selected and weighted proportionally to their distance from the ground truth using a Gaussian kernel with $\sigma=5$.

Thus, the multinomial cross-entropy loss is: image where $v(\cdot)$ is a weighting term that can be used to rebalance the loss based on color-class rarity. image Each pixel is weighted by factor $w$ based on its closest ab bin, and $\lambda=1/2$.

To be brief, the class-imbalance problem is achieved by reweighting the loss of each pixel at train time based on the pixel color rarity as above.

Finally, we map probability distribution $\hat{Z}$ to color values $\hat{Y}$ with function: image This mapping $H$ is mentioned as below in Section 1.3.

NorbertZheng commented 1 year ago

Use classification loss to reduce the difficulty of the pretext task???

NorbertZheng commented 1 year ago

Class Probabilities to Point Estimates

image Effect of temperature parameter T on the annealed-mean output.

$H$ is defined to map the predicted distribution $\hat{Z}$ to point estimate $\hat{Y}$ in ab space: image

The temperature $T=0.38$, shown in the middle column of the above figure, captures the vibrancy of the mode while maintaining the spatial coherence of the mean.

(The introduction of temperature $T$ into Softmax can be referenced from Model Distillation)

Hence, the final system $F$ is the composition of CNN $G$, which produces a predicted distribution over all pixels, and the annealed-mean operation $H$, which produces a final prediction. (The system is NOT quite end-to-end trainable.)

NorbertZheng commented 1 year ago

NOT end-to-end trainable!!!

NorbertZheng commented 1 year ago

Colorization Results

image Colorization results on 10k images in the ImageNet validation set.

The 1.3M (i.e. 1'300'000) images from the ImageNet training set are used for training, the first 10k (i.e. 10'000) images in the ImageNet validation set are used for validation, and a separate 10k (i.e. 10'000) images in the validation set are used for testing.

Perceptual realism (AMT)

A real vs. fake two-alternative forced choice experiment is ran on Amazon Mechanical Turk (AMT). 40 Participants were asked to click on the photo they believed contained fake colors.

The proposed full algorithm fooled participants on 32% of trials, which is significantly higher than all compared algorithms. These results validate the effectiveness of using both

NorbertZheng commented 1 year ago

Class rebalancing is just like an adversarial loss with an inductive bias towards colorization!!!

NorbertZheng commented 1 year ago

Semantic Interpretability (VGG Classification)

It is tested by feeding our fake colorized images to a VGG. If the classifier performs well, that means the colorizations are accurate.

Classifier performance drops from 68.3% to 52.7% after ablating colors from the input. After re-colorizing using our full method, the performance is improved to 56.0%.

NorbertZheng commented 1 year ago

Use machines, instead of humans, to check the generative performance.

NorbertZheng commented 1 year ago

Legacy Black and White Photos

image Applying the proposed method to legacy black and white photos.

The proposed model is still able to produce good colorizations, even though the low-level image statistics of the legacy photographs are quite different from those of the modern-day photos.

NorbertZheng commented 1 year ago

Zero-shot!!!

NorbertZheng commented 1 year ago

More Examples

image More examples and results are in the appendix of the paper.

NorbertZheng commented 1 year ago

Self-Supervised Learning Results

The colorization approach serves as a pretext task for representation learning.

The network model is akin to an autoencoder, except that the input and output are different image channels, suggesting the term cross-channel encoder.

image Left: ImageNet Linear Classification, Right: PASCAL Tests.

ImageNet Classification

The pre-trained networks are frozen and linear classifiers are learnt on top of the each convolutional layer for ImageNet classification.

Pretraining are performed without semantic label information.

AlexNet directly trained on ImageNet classification achieves the highest performance, and serves as the ceiling for this test.

The proposed method outperforms Gaussian, k-means, and Context Encoders [10].

NorbertZheng commented 1 year ago

PASCAL Classification

The network is trained by freezing the representation up to certain points, and fine-tuning the remainder.

The network is effectively only able to interpret grayscale images.

PASCAL Detection

Fast R-CNN framework is used.

The proposed method outperforms k-means. However, it is inferior to Context Prediction [14].

PASCAL Segmentation

FCN architecture is used.

The proposed grayscale fine-tuned network achieves performance of 35.0%, approximately equal to Donahue et al. [16], and adding in color information increases performance to 35.6% (how???).

NorbertZheng commented 1 year ago

By learning the colorization as pretext task without ground truth labels, useful features are learnt, which can be used for downstream tasks, such as image classification, detection, and segmentation.

NorbertZheng commented 1 year ago

Reference