Generative Pretraining from Pixels

xyq7 commented 4 years ago

Paper information

Title: Image GPT: Generative Pretraining from Pixels Authors: Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, David Luan, Ilya Sutskever pdf link: https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V1_ICML.pdf

Summary

problems to address This paper investigates the results of applying “generative pre-training for classification task” which works well in NLP to the world of images. In this way we can use large but unlabeled image datasets in generative pretraining for image classification task.
key ideas For pre-training, we can simply preprocess and unroll the images to sequence and use sequence generative model for the next pixel prediction. After that, a small classification head is added for fine-tuning.
quick results For CIFAR-10, it gets 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and gets 99.0% accuracy with full fine-tuning, matching the top supervised pretrained models. For ImageNet, it achieves 69.0% top-1 accuracy on a linear probe of the features. It looks to have a better result on CIFAR than ImageNet. Some think that it is because the network work on low resolution resized image and images on CIFAR lose less info.

What do you like?

The experiments in this work is comprehensive. It gives some conclusion other than the generative pre-training works in classification of images. First, the curve of linear probe accuracy has shown that the generative model might have learned the global information in the intermediate layers, and the features of the later layers may focus on the predicted pixel. Second, the relation of linear probe accuracy and the validation generative loss shows the positive correlation property of the performance of generative model and the quality representation of the feature for classification. These experiements add to the explainability of the effectiveness of generative pretraining.

What you don't like?

I think there might be some experiments about fusing 2-d information, since the current generative model does not use any space cue but it might be important for images?

How to improve?

The fusion of 2-d information might be in the process of preprocessing the images to sequences or using some image generation model with sliding anchors as generative pre-training.

Any new ideas?

I wonder whether there are similar work for image generation model pre-training for image classification？We did a work using image generation with sliding anchors for object defect classification before. There might be some connection.

Reproducing results (if any)

Not yet.

hunkim commented 4 years ago

Wonderful!

fusion of 2-d information

Could you elaborate more on this point?

xyq7 commented 4 years ago

Wonderful!

fusion of 2-d information

Could you elaborate more on this point?

In this work, the images are preprocessed by resizing to a low resolution and reshaping into a 1D sequence, for example , a image of 3*224*224 is first downscaled and reduce the channel to like 32*32 and unroll it to 1d. This process does not encode the 2D spatial structure of images. (It might because the aim of ths work is to prove the effect of the generative pretraining.)

On one side, some think that the result on Imagenet is not that good is because the network simply resizes the image, and therefore a learnable 2D conv layer to downsample it may lead to a better result. Also, I'm not sure, but to represent a image in sequence may have many different approaches, here we need to assure that the former sequence doesn't contain info of the latter one to predict and to get this representation we don't need labels for training.

Also, I wonder whether it is tried to use 2D generation model as generative pretraining for images. For example, in a previous work, our aim is to classify whether the object is good or defected, we train a generative model with hollowed images, something like we get a sliding anchor on a image and the overlapped part is replaced by gaussian noise, with these input we trained the network to generate the complete image. And the generation result is used to classify whether the object(drawing pin, chestnut, tablets etc) in the image is defected. In fact, it might happen that the generation model learn something related to the class of the object, since seldom did we generate the part of other kinds of object. It might have some similarity with this work and 2D generation model would make use of spatial cue.

russellkim commented 4 years ago

Hi! @xyq7 It's a quite interesting idea. As you know the reason the author use 1D sequence as input is to use BERT or GPT2 model that handles word sequence and relationship among words well. Do you have any idea to utilize 2D information with BERT or GPT2 model? And could you explain "Linear probing"? I did not get it. Thanks.

xyq7 commented 4 years ago

As far as I know, linear probing here is to judge how good a representation in intermediate layers is. The features are fixed and they can only optimize classfication head. It assess how linearly classifiable a representation is. About utilizing 2D info with BERT and GPT2, I would like to first know more about these models, and I will read some related work after finishing the last presentation in PKU's lab on Wednesday: )

AIM-SE / PR4Rec