allenai / mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
MIT License
887 stars 33 forks source link

some shards cannot be accessed with 404 error #16

Open TobiasLee opened 1 year ago

TobiasLee commented 1 year ago

Hi, thanks for your great project.

I am downloading the shard with fewer_faces_core_v3 script, but I found that some shards cannot be accessed. The detailed list is as below:

shard_1277.zip shard_3218.zip shard_3267.zip shard_5064.zip shard_5146.zip shard_7119.zip shard_8991.zip shard_9750.zip shard_11899.zip shard_15127.zip shard_15252.zip shard_16996.zip shard_17369.zip shard_17499.zip shard_17818.zip shard_22953.zip

any solutions on this problem?

jmhessel commented 1 year ago

Thanks for letting me know!

https://github.com/allenai/mmc4#the-missing-shards-%EF%B8%8F gives some known to be missing shards:

3218,3267,5064,5146,7119,8991,9750,11899,15127,15252,16996,17369,17499,17818

there is some overlap with that list, but not fully (e.g., 1277). I will TAL when I can shortly... are these shards missing for the other versions of the dataset?

TobiasLee commented 1 year ago

i downloaded two versions: core v3 and v2. only v3 has additional missing shards.