Discrepancy in Zero-Shot Top-1 Accuracy for ViT-H/14 Replication

rex-yue-wu commented 1 year ago

Hello,

I've been working to replicate the fine-tuned model results presented in your CLIPA repository, specifically for the ViT-H/14. However, I've observed that the zero-shot top-1 accuracy of my model is approximately 1.5% lower than the figures reported. Considering that I'm utilizing P4 instances (8xA100s) setup on AWS, I'm uncertain if this discrepancy falls within an expected margin.

Could you kindly share the training logs for the fine-tuned models listed? Access to these logs would greatly assist me in conducting a thorough comparison to possibly pinpoint any underlying issues.

Additionally, I have a few questions about the fine-tuning process:

The current setting is at 512M samples with a resolution of 224x224. Is the choice of 512M samples driven by budgetary constraints, or was there another rationale?

If additional budget were available, would it be more beneficial to allocate it towards fine-tuning with a larger number of samples at the same resolution, or a smaller number at a higher resolution? Which one lead us a better model?
Regarding the 512M samples used for fine-tuning, were they selected randomly, or have you conducted any experiments to determine the impact of fine-tuning on different 512M sample subsets of the dataset?

Any insights you could provide would be immensely appreciated.

Thank you for your time and consideration.

xhl-video commented 1 year ago

Hi, thanks for your interest in our work. I just uploaded all the logs and configs in google_drive. Just to let you know, 512M samples are selected randomly. We have a detailed ablation of # seen samples (Table 10) in our paper. Our high-resolution (336) model is fine-tuned after 224-finetuning and the number of seen samples is 128M.

rex-yue-wu commented 1 year ago

For those interested in this thread, I've shared my plots based on Xianhang's training logs.

rex-yue-wu commented 1 year ago

@xhl-video Can you please also share training logs of BigG models?

UCSC-VLAA / CLIPA

Discrepancy in Zero-Shot Top-1 Accuracy for ViT-H/14 Replication #8