allenai / mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
MIT License
887 stars 33 forks source link

feat: add all shards download & unzip script #2

Closed Luodian closed 1 year ago

Luodian commented 1 year ago
  1. feat: add all shards download & unzip script for no_facev2 and no_face_corev3
  2. update corresponding readme to execute the command

written with the help of GPT-4, and checked validity by luodian (drluodian@gmail.com)

Luodian commented 1 year ago

I already checked the scripts that they can download the mmc4 zipped jsonls, and then use download_image.py to download corresponding images then use convertion script to create the required shards file for training MMC4 on Openflamingo.

jmhessel commented 1 year ago

Thanks @Luodian and @VegB !