I have test the model on ImageNet-1k val set with zero-shot setting and the labels are translated to Chinese. However the top1 accuracy is only around 25%. As a comparison, the digit on CLIP is 65%.
On AIC-ICC, the text2image recall@top10 is 13%, which is also far from the digit in BriVL paper(~40%).
Could the authors help to give some reference results to verify the results on the two datasets?
I have test the model on ImageNet-1k val set with zero-shot setting and the labels are translated to Chinese. However the top1 accuracy is only around 25%. As a comparison, the digit on CLIP is 65%. On AIC-ICC, the text2image recall@top10 is 13%, which is also far from the digit in BriVL paper(~40%). Could the authors help to give some reference results to verify the results on the two datasets?