alipay / Ant-Multi-Modal-Framework

Research Code for Multimodal-Cognition Team in Ant Group
Creative Commons Attribution 4.0 International
60 stars 2 forks source link

Zero-shot accuracy on ImageNet (in the CLIP setting) is lower than the number reported in the paper #6

Open shyammarjit opened 3 months ago

shyammarjit commented 3 months ago

Zero-shot accuracy on ImageNet (in the CLIP setting)

Top-1 accuracy: 77.15 Top-5 accuracy: 95.51

Paper reported accuracy on ImageNet is (Top-1): 88.5

209ye commented 3 months ago

The 88.5 accuracy mentioned in the paper here should be the 10B model, which seems to have not been announced yet. The published 0.4b model looks at the data in the paper and is (Top-1) 78.5.

shyammarjit commented 1 month ago

How do I load the 10B model? Is it open-sourced yet?