Open kimihailv opened 2 years ago
Hey,
Now that you mention it, it looks like XTD includes train images in their translated captions. Which, in my humble opinion, is a rather weird decision... At least when there's still data from val+test that they have not used... ? So yes, there seems to be data leakage in our evaluation.
We're currently working on creating a better evaluation system at CLIP_BENCHMARK, and we are working towards creating some multilingual evaluations.
The evaluations at this repo should be updated when such evaluations are available.
How did you evaluate Table 1 in the original paper ('Cross-lingual and Multilingual CLIP')? The space of retrievable images were the 1k images from XTD-10 dataset? Because there's null interesection between the images of that dataset and the MSCOCO 2014 test set.
Hello! According to XTD-10 repo, the test set contains 800 images from MSCOCO train set. During training you also use MSCOCO train set – it seems you have data leak. Or may be I don't understand something.