LAION-AI / CLIP_benchmark

CLIP-like model evaluation
MIT License
623 stars 80 forks source link

add chinese flickr30k and flickr8k from https://github.com/li-xirong/cross-lingual-cap @yangapku #101

Closed mehdidc closed 1 year ago

mehdidc commented 1 year ago

Issue #33

mehdidc commented 1 year ago

xlm-roberta-large-ViT-H-14 results on the test set:

{"dataset": "flickr30k", "model": "xlm-roberta-large-ViT-H-14", 
"pretrained": "frozen_laion5b_s13b_b90k", 
"task": "zeroshot_retrieval",
 "metrics": {
"image_retrieval_recall@5": 0.738547682762146, 
"text_retrieval_recall@5": 0.9359999895095825
},
 "language": "zh"
}

Chinese CLIP (https://github.com/OFA-Sys/Chinese-CLIP/tree/master) reports 0.914 for R@5 image retrieval and 0.97.5 for text retrieval.

czczup commented 1 year ago

I think this result of xlm-roberta-large-ViT-H-14 maybe wrong. Because the annotation file of Flickr-CN used in the code contains the annotation of the parts of speech of Chinese words (such as verb v, noun n), and this is not handled in dataloader.

In my experiments, the results are:

image
czczup commented 1 year ago

I use this annotation file.

flickr30k_cn_test.txt

image