Closed visheratin closed 1 year ago
Also, this PR depends on the PR in the OpenCLIP repo because we need to change how the text is tokenized to properly support NLLB tokenizer.
The OpenCLIP PR is merged. @mehdidc could you please check the PR?
Checking now, thanks a lot @visheratin for the PR!
@visheratin Many thanks, all the datasets work fine!
I am trying to reproduce the numbers in your paper, but I see some gaps. For Table 3, NLLB-CLIP Base, I was using the following:
clip_benchmark eval --pretrained v1 --model nllb-clip-base --dataset multilingual_mscoco_captions --language en fr zh de es it tr ru ko pl jp --recall_k 1 5 10 --output "benchmark_{dataset}_{pretrained}_{model}_{language}_{task}.json"
I got the following results:
language | image_retrieval_recall@1 | image_retrieval_recall@5 | image_retrieval_recall@10 | |
---|---|---|---|---|
1 | en | 47.2 | 80.6 | 88.3 |
0 | fr | 45.0 | 76.9 | 85.8 |
8 | zh | 41.1 | 71.7 | 83.0 |
5 | de | 43.3 | 74.8 | 84.7 |
6 | es | 44.1 | 75.9 | 85.4 |
3 | it | 44.7 | 75.9 | 85.7 |
2 | tr | 41.3 | 73.4 | 84.3 |
9 | ru | 40.6 | 71.0 | 82.8 |
10 | ko | 39.4 | 70.9 | 81.2 |
7 | pl | 45.5 | 76.1 | 87.1 |
4 | jp | 37.9 | 65.2 | 77.8 |
Is there anything I am missing?
@mehdidc The changes from the PR in OpenCLIP haven't yet been released, so the tokenizer doesn't work correctly. You need to install OpenCLIP from the main branch. If you run the benchmark on the nllb-clip-base-siglip
model, they should match the numbers from this CSV.
Regarding the numbers from the paper, I ran those tests on a slightly different version of the model, so the numbers for nllb-clip-base
won't precisely match the numbers in the paper, although they should be quite close.
@visheratin I was using the main branch actually, will rerun with the siglip version to compare with the csv, thank you.
@mehdidc Looks like the tokenizer works fine. The difference is because the models are not exactly the same. For some languages, the model from the paper is better; for others, the OpenCLIP version is better.
I will run the CLIP benchmark on all models from OpenCLIP this week so we will have a full picture for multilingual retrieval.
I will run the CLIP benchmark on all models from OpenCLIP this week so we will have a full picture for multilingual retrieval.
Cool, sounds good!
I ran the siglip version of the base model and it is matching the CSV.
language | image_retrieval_recall@1 | image_retrieval_recall@5 | image_retrieval_recall@10 | |
---|---|---|---|---|
13 | en | 68.5 | 88.8 | 95.2 |
3 | fr | 64.8 | 86.9 | 93.5 |
17 | zh | 58.4 | 84.7 | 92.4 |
10 | de | 62.8 | 87.0 | 93.5 |
0 | es | 65.2 | 86.7 | 93.9 |
4 | it | 64.3 | 87.9 | 94.4 |
20 | tr | 63.7 | 86.8 | 94.3 |
21 | ru | 60.7 | 82.4 | 88.6 |
18 | ko | 59.9 | 83.6 | 92.2 |
9 | pl | 65.4 | 87.6 | 94.4 |
5 | jp | 54.6 | 80.8 | 87.8 |
Merging.
Hi!
When working on the paper about NLLB-CLIP models, I benchmarked the models on multiple multilingual datasets. I thought it would be a good idea to add these benchmarks to this repo to be able to test other multilingual models.
The changes in this PR:
set_language
method to set the target language for the NLLB tokenizer.@mehdidc @JeniaJitsev