allenai / mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
MIT License
887 stars 33 forks source link

Is there a quick way to download raw images? #7

Closed PhoebusSi closed 1 year ago

PhoebusSi commented 1 year ago

Downloading images directly from various websites is too slow. Do you have any packaged image files available?

jmhessel commented 1 year ago

Hi @PhoebusSi ! At the moment, for legal reasons we cannot provide bulk access to raw image files. However, 1) we are looking into options because we understand this makes using the dataset more difficult; and 2) we are hoping to soon provide the multi-threaded downloading script that we used to gather many images in a short time.

[ edit: deferring details to @vegb who knows more than me about this step! ]

jmhessel commented 1 year ago

Added, thanks to @vegb https://github.com/allenai/mmc4/blob/main/scripts/download_images.py !