BAAI-WuDao / BriVL

Bridging Vision and Language Model
MIT License
279 stars 31 forks source link

The top1-acc on ImageNet-1k and recall on AICICC #8

Open shenfalong opened 2 years ago

shenfalong commented 2 years ago

I have test the model on ImageNet-1k val set with zero-shot setting and the labels are translated to Chinese. However the top1 accuracy is only around 25%. As a comparison, the digit on CLIP is 65%. On AIC-ICC, the text2image recall@top10 is 13%, which is also far from the digit in BriVL paper(~40%). Could the authors help to give some reference results to verify the results on the two datasets?

SCZwangxiao commented 2 years ago

I am about to reproduce the results as well. Could you tell me what prompt did you use?

shenfalong commented 2 years ago

'这是一张关于{class_name}的图' this prompt could improve about~2%. Absolutely it cannot fill the gap between 20% and 60%.