Different results in 《Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision》 and 《SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM》

Sense-GVT / DeCLIP

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

622 stars 31 forks source link

Different results in 《Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision》 and 《SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM》 #22

Open Kaicheng-Yang0828 opened 1 year ago

Kaicheng-Yang0828 commented 1 year ago

In the paper《SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM》, training on the YFCC_V2 dataset, CLIP and DECLIP can get 31.3 and 41.9 zero-shot performance of Imagenet, but it is reported 37.3 and 44.4 in the paper 《Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision》. So what's the difference between them?