Closed HaniItani closed 10 months ago
Hi, @HaniItani,
Yes, we modify from SLIP repo. However, their all_gather operator doesn't have gradient back-propagation, you need to switch to another one that have gradient back-propagation, e.g. torch.distributed.nn.all_gather
.
Yes, we used qk_norm. Our initial trial shows it improves the training stability a bit so we kept it. Performance-wise, I don't think it caused noticeable differences.
For the linear probing, we followed the setup in, e.g., section A.6 of this paper.
Hi @HaniItani , Thanks again for your interest, please refer to our loss implementation
Hello,
Thank you for sharing your great work!
I'm having issues reproducing your CLIP baselines for CC3M using the original captions. I suspect you are using the SLIP codebase to train your models for CC3M and CC12M, and I'm doing the same. I also noticed that the difference between your CLIP implementation and theirs is the qk_norm in the attention blocks in the vision transformer.
I've implemented your evaluation pipeline and managed to reproduce the linear probing numbers you report for some of the datasets using your provided checkpoint. You can see that the gap between my reproduction attempt and yours is substantial. Note that best ImageNet zero shot I got in my reproduction is 15.2 which is comparable to your reported 15.8.
Can you please advise? Thanks!