allenai / mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
MIT License
887 stars 33 forks source link

Any recommended code for converting mmc4 into WebDataset format instead of jsonl format? #5

Closed roboswell closed 1 year ago

roboswell commented 1 year ago

I noticed that when downloading mmc4-ff, it downloads jsonl files. However, the Open Flamingo model requires dataset shards for training to be in WebDataset format. Could you please recommend code for converting jsonl database files into WebDataset shard format?

jmhessel commented 1 year ago

cc @anas-awadalla

anas-awadalla commented 1 year ago

Yes will share a script soon :)

anas-awadalla commented 1 year ago

I have added the script here thank you!

roboswell commented 1 year ago

@anas-awadalla Presently the script you wrote only allows for 2 inputs as arguments (image_shards and doc_shards). Will you be modifying the script soon to allow for CLIP feature shards rather than image_shards? Thanks!

anas-awadalla commented 1 year ago

The CLIP features are not suitable for training Flamingo models so for now I will be keeping it as is. My suggested workflow be to download raw images using this script and then convert those to webdataset shards.

roboswell commented 1 year ago

Hi @anas-awadalla, could you help me understand more why the CLIP features for mmc4 (downloadable from https://storage.googleapis.com/ai2-jackh-mmc4-public/images/clip_vitl14_shard_{$SHARD}_features.pkl) are unable to be used for training even though they were (I assume) the same CLIP features you used to train the Open Flamingo 9B vision encoder?

anas-awadalla commented 1 year ago

Yep. First, I apologize for the confusion regarding the CLIP embeddings (I think I mentioned they could be used to train flamingo models in an OpenFlamingo issue). This was a misunderstanding on my end. What you will need to to create the image tokens for Flamingo is the patch embeddings from the vision encoder of CLIP. However, the embeddings in mmc4 are the projection vector of the image to the multimodal space.

One thing I want to point out is that we do not train any vision encoder and instead use this pre-trained CLIP model.

jmhessel commented 1 year ago

closing this as addressed, feel free to re-open if I'm misreading