allenai / mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
MIT License
901 stars 34 forks source link

Access to discarded higher aspect ratio images #21

Open ganyeshprasanna opened 6 months ago

ganyeshprasanna commented 6 months ago

Hello, First of all thank you for all the wonderful work!

I was going through the paper and how images are filtered from the urls in the c4 corpus. I was wondering is there any way in which we can get images for the banner-like ads, i.e., images with high or low aspect ratio (>2 or < 0.5). Thanks in advance!