UCSC-VLAA / CLIPA

[NeurIPS 2023] This repository includes the official implementation of our paper "An Inverse Scaling Law for CLIP Training"
Apache License 2.0
298 stars 12 forks source link

Discrepancy in Zero-Shot Top-1 Accuracy for ViT-H/14 Replication #8

Closed rex-yue-wu closed 1 year ago

rex-yue-wu commented 1 year ago

Hello,

I've been working to replicate the fine-tuned model results presented in your CLIPA repository, specifically for the ViT-H/14. However, I've observed that the zero-shot top-1 accuracy of my model is approximately 1.5% lower than the figures reported. Considering that I'm utilizing P4 instances (8xA100s) setup on AWS, I'm uncertain if this discrepancy falls within an expected margin.

Could you kindly share the training logs for the fine-tuned models listed? Access to these logs would greatly assist me in conducting a thorough comparison to possibly pinpoint any underlying issues.

Additionally, I have a few questions about the fine-tuning process:

The current setting is at 512M samples with a resolution of 224x224. Is the choice of 512M samples driven by budgetary constraints, or was there another rationale?

Any insights you could provide would be immensely appreciated.

Thank you for your time and consideration.

xhl-video commented 1 year ago

Hi, thanks for your interest in our work. I just uploaded all the logs and configs in google_drive. Just to let you know, 512M samples are selected randomly. We have a detailed ablation of # seen samples (Table 10) in our paper. Our high-resolution (336) model is fine-tuned after 224-finetuning and the number of seen samples is 128M.

rex-yue-wu commented 1 year ago

For those interested in this thread, I've shared my plots based on Xianhang's training logs. image

rex-yue-wu commented 1 year ago

@xhl-video Can you please also share training logs of BigG models?