Closed mehdidc closed 1 year ago
xlm-roberta-large-ViT-H-14 results on the test set:
{"dataset": "flickr30k", "model": "xlm-roberta-large-ViT-H-14",
"pretrained": "frozen_laion5b_s13b_b90k",
"task": "zeroshot_retrieval",
"metrics": {
"image_retrieval_recall@5": 0.738547682762146,
"text_retrieval_recall@5": 0.9359999895095825
},
"language": "zh"
}
Chinese CLIP (https://github.com/OFA-Sys/Chinese-CLIP/tree/master) reports 0.914 for R@5 image retrieval and 0.97.5 for text retrieval.
I think this result of xlm-roberta-large-ViT-H-14 maybe wrong. Because the annotation file of Flickr-CN used in the code contains the annotation of the parts of speech of Chinese words (such as verb v, noun n), and this is not handled in dataloader.
In my experiments, the results are:
I use this annotation file.
Issue #33