dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.41k stars 208 forks source link

About MS-COCO pre-training dataset #69

Open 4fee8fea opened 2 years ago

4fee8fea commented 2 years ago

Hi, @dandelin

Thanks for your great work and make it public.

I have followed the link in DATA.md to downlaod MS-COCO 2014 train images, 2014 val images, and karpathy split.

The number of images and captions I can access is 123,287 and 646767, respectively, which is different from that reported in the paper, as 113K and 567K, respectively.

May I ask what's the reason for the difference?

Thanks in advance

S-Moer commented 2 years ago

uh-may be their is nobody answer your question,I met the same problem.