allenai / mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
MIT License
901 stars 34 forks source link

add download script for fewer_facev2 and fewer_face_corev3 #3

Closed Luodian closed 1 year ago

Luodian commented 1 year ago

Hi Jack and Wanrong, thank you for providing this valuable dataset!

I'm currently utilizing the data to experiment with the openflamingo model and have written a download script for the currently released splits (fewer_faces_v2 and fewer_faces_core_v3). This should help other users quickly prepare and access this dataset.

I've created PR #2, and I hope it contributes positively to the project.

HenryHZY commented 1 year ago

That's great:)

By the way, I also use your script for downloading.

jmhessel commented 1 year ago

Thanks for the contrib! (love the gpt-4 contrib also :-) ). I'll take a closer look at this when I have a chance soon.

Luodian commented 1 year ago

I made a change to the download script, because I saw flamingo's data preprocessing script use *.zip file.

To make it more compatible, I comment out the unzip process.

jmhessel commented 1 year ago

(merged!, thanks @Luodian )