facebookresearch / mae

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377
Other
7.34k stars 1.22k forks source link

Fig 8 of paper - Have you tried to Pre-train on JFT300M? #96

Open JunzheJosephZhu opened 2 years ago

JunzheJosephZhu commented 2 years ago

In figure 8 of the paper, the authors compared IN1K pre-trained MAE with JFT300M supervised results. Have you tried pre-training an MAE on JFT300M to see if MAE outperforms supervised training on large datasets?

wikiwen commented 2 years ago

The reason may be that JFT300M is private of google, so the experiment coudn't be done.

wikiwen commented 2 years ago

I have a similiar question: why not pre-train on a bigger dataset to boost performance? like imagenet21k, or Instagram-1B. In moco_v1 paper, the author have a guess that masked auto-encoding may exploit large-scale's information better. Does the mae paper prove this point? @KaimingHe @endernewton

MoCo’s improvement from IN-1M to IG-1B is consistently noticeable but relatively small, suggesting that the larger-scale data may not be fully exploited. We hope an advanced pretext task will improve this. Beyond the simple instance discrimination task [61], it is possible to adopt MoCo for pretext tasks like masked auto-encoding, e.g., in language [12] and in vision.

daisukelab commented 2 years ago

FYI, The paper "When Does Contrastive Visual Representation Learning Work?" https://arxiv.org/abs/2105.05837 shows that the current SSL might not benefit from very large pretraining sets.

wikiwen commented 2 years ago

Hi, daisukelab. Thanks for your reply. I have read this paper. The focus of "how much data is required to learn a good representation using SSL" in this paper is very significant. But I have a question about the conclusion "There is little benefit beyond 500k pretraining images". The reason that the performance don't increase may be lack of bigger model. If a bigger model is used, ssl might be able to better capture more subtle difference in images.

FYI, The paper "When Does Contrastive Visual Representation Learning Work?" https://arxiv.org/abs/2105.05837 shows that the current SSL might not benefit from very large pretraining sets.

daisukelab commented 2 years ago

The reason that the performance don't increase may be lack of bigger model. If a bigger model is used, ssl might be able to better capture more subtle difference in images.

Hi, wikiwen, right, it could be the model is not big enough. One more thing I can think of is that the pre-training objective of reconstruction might not be good enough when it comes to learning subtle details. Looking at the blurry reconstruction sometimes makes me wonder if we could make it try to recover sharp details.

wikiwen commented 2 years ago

The reason that the performance don't increase may be lack of bigger model. If a bigger model is used, ssl might be able to better capture more subtle difference in images.

Hi, wikiwen, right, it could be the model is not big enough. One more thing I can think of is that the pre-training objective of reconstruction might not be good enough when it comes to learning subtle details. Looking at the blurry reconstruction sometimes makes me wonder if we could make it try to recover sharp details.

Yeah, I agree with you. Adversarial loss used in this repo may be a choice.

XuZhengzhuo commented 2 years ago

Will there be experiments on IN21K? Looking forward to the pretrained ckpt!