Open JunzheJosephZhu opened 2 years ago
The reason may be that JFT300M is private of google, so the experiment coudn't be done.
I have a similiar question: why not pre-train on a bigger dataset to boost performance? like imagenet21k, or Instagram-1B. In moco_v1 paper, the author have a guess that masked auto-encoding may exploit large-scale's information better. Does the mae paper prove this point? @KaimingHe @endernewton
MoCo’s improvement from IN-1M to IG-1B is consistently noticeable but relatively small, suggesting that the larger-scale data may not be fully exploited. We hope an advanced pretext task will improve this. Beyond the simple instance discrimination task [61], it is possible to adopt MoCo for pretext tasks like masked auto-encoding, e.g., in language [12] and in vision.
FYI, The paper "When Does Contrastive Visual Representation Learning Work?" https://arxiv.org/abs/2105.05837 shows that the current SSL might not benefit from very large pretraining sets.
Hi, daisukelab. Thanks for your reply. I have read this paper. The focus of "how much data is required to learn a good representation using SSL" in this paper is very significant. But I have a question about the conclusion "There is little benefit beyond 500k pretraining images". The reason that the performance don't increase may be lack of bigger model. If a bigger model is used, ssl might be able to better capture more subtle difference in images.
FYI, The paper "When Does Contrastive Visual Representation Learning Work?" https://arxiv.org/abs/2105.05837 shows that the current SSL might not benefit from very large pretraining sets.
The reason that the performance don't increase may be lack of bigger model. If a bigger model is used, ssl might be able to better capture more subtle difference in images.
Hi, wikiwen, right, it could be the model is not big enough. One more thing I can think of is that the pre-training objective of reconstruction might not be good enough when it comes to learning subtle details. Looking at the blurry reconstruction sometimes makes me wonder if we could make it try to recover sharp details.
The reason that the performance don't increase may be lack of bigger model. If a bigger model is used, ssl might be able to better capture more subtle difference in images.
Hi, wikiwen, right, it could be the model is not big enough. One more thing I can think of is that the pre-training objective of reconstruction might not be good enough when it comes to learning subtle details. Looking at the blurry reconstruction sometimes makes me wonder if we could make it try to recover sharp details.
Yeah, I agree with you. Adversarial loss used in this repo may be a choice.
Will there be experiments on IN21K? Looking forward to the pretrained ckpt!
In figure 8 of the paper, the authors compared IN1K pre-trained MAE with JFT300M supervised results. Have you tried pre-training an MAE on JFT300M to see if MAE outperforms supervised training on large datasets?