facebookresearch / SLIP

Code release for SLIP Self-supervision meets Language-Image Pre-training
MIT License
747 stars 69 forks source link

CC3M results cannot be reproduced #15

Open kumamonatseu opened 2 years ago

kumamonatseu commented 2 years ago

Thanks for the great paper. However, I cannot reproduce results based on this repo and it would be greatly appreciated if more details could be provided.

Similar to issue #9 , I also cannot reproduce the results of CC3M with 64 GPUs. Concretely, on the ImageNet-1K linear probing task, the model is trained on CC3M for 40 epochs with your recommended hyper-parameters (weight decay 0.1, learning rate 3e-3 and warmup 2 epochs). However, on ImageNet-1K linear probing task, the 65.4% top-1 accuracy (as reported in the paper) cannot be achieved, and only ~50% top-1 accuracy is achieved.

Besides, I also notice that in your released checkpoint, the hyper-parameters are not fully contained in the 'args' key, e.g., the ssl_scale is lost which could be essential for reproduction. The checkpoint of CC3M also conveys the weight decay is set to 0.5, but the paper and README both suggest the weight decay is 0.1. You also answered that the best result is achieved far away from the last epoch, and the checkpoint conveys the epoch is 36. I also tested the model from the 36th epoch, the result is still far away from 65.4%.

Hope for your reply on hyper-parameters for pre-training on CC3M. As stated in issue #9 , it would be greatly helpful if the training log could be provided. If the wandb is used, it should be uploaded to the cloud and easy to find.

bram-w commented 2 years ago

I potentially have a similar problem with CC12M. Our subset is smaller than the one that FAIR has (10M instead of 11M due to some of the URLs being dead), but CLIP zero-shot imagenet performance lags by ~8 percentage points which is more than I would expect for that level of data difference.

Thanks in advance and I really appreciate the work!

normster commented 2 years ago

Sorry for the late reply. The training parameters for CC3M/CC12M only differed from the YFCC15M is the number of epochs. In all of our experiments, unless otherwise indicated, we used ssl_scale=1. Per the paper, the SLIP/SimCLR models should be trained with wd=0.1. When preparing for code/model release I also manually stripped out extraneous args in the checkpoints and it's possible that I overwrote the wrong value here, however even with wd=0.5 performance shouldn't degrade that much.

It's possible that our versions of CC3M/CC12M differ in some significant manner. I didn't download this data myself so will have to ask around to see what kind of post-processing might have been performed on the captions/images.