Question about GCC dataset download

dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Apache License 2.0

1.36k stars 209 forks source link

Hi @yr666666

GCC (CC3M) provides the dataset in the form of image URLs and their related caption. Since their original filenames are un-ordered and they have various formats, I renamed them to the ordered sequence without the extension (like .jpg, .png, ...) during the download. So these renamed "image files (binaries)" have names such as 0000000, 0000001, ..., 2983222, etc.

If I put all files in a single directory, it slows down disk-related operations. Thus I partitioned them into several directories named "first four letters of the image name" so that every directory has 1000 files at maximum.

dandelin / ViLT

Question about GCC dataset download #45