Multilingual benchmark datasets

visheratin commented 1 year ago

Hi!

When working on the paper about NLLB-CLIP models, I benchmarked the models on multiple multilingual datasets. I thought it would be a good idea to add these benchmarks to this repo to be able to test other multilingual models.

The changes in this PR:

Added French and German languages to the multilingual MS COCO dataset to make it the full XTD10 dataset. Also, changed the logic of extracting images to avoid downloading full train and validation datasets to get only 1000 images.
Added Crossmodal-3600 dataset - 3600 captions in 36 languages, including five low-resource ones.
Added Flickr30k-200 dataset - 1000 captions from Flickr30k dataset translated to 200 languages using NLLB-3.3B model. Introduced in the NLLB-CLIP model.
Added XTD200 dataset - captions from XTD10 dataset translated to 200 languages using NLLB-3.3B model. Introduced in the NLLB-CLIP model.
Added the set_language method to set the target language for the NLLB tokenizer.
Enabled building CSV files from directories.

@mehdidc @JeniaJitsev

visheratin commented 1 year ago

Also, this PR depends on the PR in the OpenCLIP repo because we need to change how the text is tokenized to properly support NLLB tokenizer.

visheratin commented 1 year ago

The OpenCLIP PR is merged. @mehdidc could you please check the PR?

mehdidc commented 1 year ago

Checking now, thanks a lot @visheratin for the PR!

mehdidc commented 1 year ago

@visheratin Many thanks, all the datasets work fine!

I am trying to reproduce the numbers in your paper, but I see some gaps. For Table 3, NLLB-CLIP Base, I was using the following:

clip_benchmark eval --pretrained v1 --model nllb-clip-base --dataset multilingual_mscoco_captions --language en fr zh de es it tr ru ko pl jp --recall_k 1 5 10 --output "benchmark_{dataset}_{pretrained}_{model}_{language}_{task}.json"

I got the following results:

	language	image_retrieval_recall@1	image_retrieval_recall@5	image_retrieval_recall@10
1	en	47.2	80.6	88.3
0	fr	45.0	76.9	85.8
8	zh	41.1	71.7	83.0
5	de	43.3	74.8	84.7
6	es	44.1	75.9	85.4
3	it	44.7	75.9	85.7
2	tr	41.3	73.4	84.3
9	ru	40.6	71.0	82.8
10	ko	39.4	70.9	81.2
7	pl	45.5	76.1	87.1
4	jp	37.9	65.2	77.8

Is there anything I am missing?

visheratin commented 1 year ago

@mehdidc The changes from the PR in OpenCLIP haven't yet been released, so the tokenizer doesn't work correctly. You need to install OpenCLIP from the main branch. If you run the benchmark on the nllb-clip-base-siglip model, they should match the numbers from this CSV.

Regarding the numbers from the paper, I ran those tests on a slightly different version of the model, so the numbers for nllb-clip-base won't precisely match the numbers in the paper, although they should be quite close.

mehdidc commented 1 year ago

@visheratin I was using the main branch actually, will rerun with the siglip version to compare with the csv, thank you.

visheratin commented 1 year ago

@mehdidc Looks like the tokenizer works fine. The difference is because the models are not exactly the same. For some languages, the model from the paper is better; for others, the OpenCLIP version is better.

I will run the CLIP benchmark on all models from OpenCLIP this week so we will have a full picture for multilingual retrieval.

mehdidc commented 1 year ago

I will run the CLIP benchmark on all models from OpenCLIP this week so we will have a full picture for multilingual retrieval.

Cool, sounds good!

I ran the siglip version of the base model and it is matching the CSV.

	language	image_retrieval_recall@1	image_retrieval_recall@5	image_retrieval_recall@10
13	en	68.5	88.8	95.2
3	fr	64.8	86.9	93.5
17	zh	58.4	84.7	92.4
10	de	62.8	87.0	93.5
0	es	65.2	86.7	93.9
4	it	64.3	87.9	94.4
20	tr	63.7	86.8	94.3
21	ru	60.7	82.4	88.6
18	ko	59.9	83.6	92.2
9	pl	65.4	87.6	94.4
5	jp	54.6	80.8	87.8

mehdidc commented 1 year ago

Merging.

LAION-AI / CLIP_benchmark

Multilingual benchmark datasets #113