dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.36k stars 209 forks source link

Question about GCC dataset download #45

Open yr666666 opened 2 years ago

yr666666 commented 2 years ago

root ├── images_train │ ├── 0000 # First four letters of the image name │ │ ├── 0000000 # Image Binary │ │ ├── 0000001 │ │ └── ... │ ├── 0001 │ │ ├── 0001000 │ │ ├── 0001001 │ │ └── ...

Hello, please forgive my stupid question. I don't know what you mean about "0000 # First four letters of image name" and "0000000 # Image Binary" in your DATA.md. Can you explain what are the "Image Binary" and "First four letters of image name"? Thanks

dandelin commented 2 years ago

Hi @yr666666

GCC (CC3M) provides the dataset in the form of image URLs and their related caption. Since their original filenames are un-ordered and they have various formats, I renamed them to the ordered sequence without the extension (like .jpg, .png, ...) during the download. So these renamed "image files (binaries)" have names such as 0000000, 0000001, ..., 2983222, etc.

If I put all files in a single directory, it slows down disk-related operations. Thus I partitioned them into several directories named "first four letters of the image name" so that every directory has 1000 files at maximum.