Closed NorbertZheng closed 1 year ago
Colorization as Pretext Task in Self-Supervised Learning, Outperforms Context Prediction & Context Encoders.
Example Input Grayscale Photos and Output Colorizations.
In this story, Colorful Image Colorization, by University of California, Berkeley, is reviewed. In this paper:
It is a paper in 2016 ECCV with over 1900 citations.
Multi-Modality!!!
Colorful Image Colorization: Network Architecture.
Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm layer. The net has no pool layers, only spatial downsampling or upsampling is used between conv blocks if needed.
Using naïve L2 loss as shown below is not robust to the inherent ambiguity and multimodal nature of the colorization problem: For example,
Multiple possible solutions lead to
The problem is treated as a multinomial classification. Quantized ab color space with a grid size of 10
The ab output space is quantized into bins with grid size 10 and keeps the $Q=313$ values which are in-gamut, as above.
For a given input $X$, we learn a mapping $\hat{Z}=G(X)$ to a probability distribution over possible colors: where $Q$ is the number of quantized ab values.
$Z$ is the vector converted from the ground truth color $Y$, using a soft-encoding scheme (i.e. not exactly one-hot, easier to learn meaningful embeddings!!!):
The 5-nearest neighbors to $Y_{h,w}$ in the output space are selected and weighted proportionally to their distance from the ground truth using a Gaussian kernel with $\sigma=5$.
Thus, the multinomial cross-entropy loss is: where $v(\cdot)$ is a weighting term that can be used to rebalance the loss based on color-class rarity. Each pixel is weighted by factor $w$ based on its closest ab bin, and $\lambda=1/2$.
To be brief, the class-imbalance problem is achieved by reweighting the loss of each pixel at train time based on the pixel color rarity as above.
Finally, we map probability distribution $\hat{Z}$ to color values $\hat{Y}$ with function: This mapping $H$ is mentioned as below in Section 1.3.
Use classification loss to reduce the difficulty of the pretext task???
Effect of temperature parameter T on the annealed-mean output.
$H$ is defined to map the predicted distribution $\hat{Z}$ to point estimate $\hat{Y}$ in ab space:
The temperature $T=0.38$, shown in the middle column of the above figure, captures the vibrancy of the mode while maintaining the spatial coherence of the mean.
(The introduction of temperature $T$ into Softmax can be referenced from Model Distillation)
Hence, the final system $F$ is the composition of CNN $G$, which produces a predicted distribution over all pixels, and the annealed-mean operation $H$, which produces a final prediction. (The system is NOT quite end-to-end trainable.)
NOT end-to-end trainable!!!
Colorization results on 10k images in the ImageNet validation set.
The 1.3M (i.e. 1'300'000) images from the ImageNet training set are used for training, the first 10k (i.e. 10'000) images in the ImageNet validation set are used for validation, and a separate 10k (i.e. 10'000) images in the validation set are used for testing.
A real vs. fake two-alternative forced choice experiment is ran on Amazon Mechanical Turk (AMT). 40 Participants were asked to click on the photo they believed contained fake colors.
The proposed full algorithm fooled participants on 32% of trials, which is significantly higher than all compared algorithms. These results validate the effectiveness of using both
Class rebalancing is just like an adversarial loss with an inductive bias towards colorization!!!
It is tested by feeding our fake colorized images to a VGG. If the classifier performs well, that means the colorizations are accurate.
Classifier performance drops from 68.3% to 52.7% after ablating colors from the input. After re-colorizing using our full method, the performance is improved to 56.0%.
Use machines, instead of humans, to check the generative performance.
Applying the proposed method to legacy black and white photos.
The proposed model is still able to produce good colorizations, even though the low-level image statistics of the legacy photographs are quite different from those of the modern-day photos.
Zero-shot!!!
More examples and results are in the appendix of the paper.
The colorization approach serves as a pretext task for representation learning.
The network model is akin to an autoencoder, except that the input and output are different image channels, suggesting the term cross-channel encoder.
Left: ImageNet Linear Classification, Right: PASCAL Tests.
The pre-trained networks are frozen and linear classifiers are learnt on top of the each convolutional layer for ImageNet classification.
Pretraining are performed without semantic label information.
AlexNet directly trained on ImageNet classification achieves the highest performance, and serves as the ceiling for this test.
The proposed method outperforms Gaussian, k-means, and Context Encoders [10].
The network is trained by freezing the representation up to certain points, and fine-tuning the remainder.
The network is effectively only able to interpret grayscale images.
Fast R-CNN framework is used.
The proposed method outperforms k-means. However, it is inferior to Context Prediction [14].
FCN architecture is used.
The proposed grayscale fine-tuned network achieves performance of 35.0%, approximately equal to Donahue et al. [16], and adding in color information increases performance to 35.6% (how???).
By learning the colorization as pretext task without ground truth labels, useful features are learnt, which can be used for downstream tasks, such as image classification, detection, and segmentation.
Sik-Ho Tang. Review — Colorful Image Colorization (Self-Supervised Learning).