allenai / mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
MIT License
887 stars 33 forks source link

Some links are Unavaliable. They are: #1

Closed PhoebusSi closed 1 year ago

PhoebusSi commented 1 year ago

https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_3218_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_3267_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_5064_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_5146_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_7119_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_8991_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_9750_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_11899_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_15127_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_15252_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_16996_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_17369_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_17499_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_17818_v2.jsonl.zip https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_23099_v2.jsonl.zip

jmhessel commented 1 year ago

Thank you for this list! We are aware that a ~.1% of the shards are missing, but I will make this more explicit in the readme. These shards are not officially a part of mmc4, i.e., they are not included in the statistics of the corpus :-)

PhoebusSi commented 1 year ago

Thank you for this list! We are aware that a ~.1% of the shards are missing, but I will make this more explicit in the readme. These shards are not officially a part of mmc4, i.e., they are not included in the statistics of the corpus :-)

Oh, I see. This work is really a great contribution to the research community.

jmhessel commented 1 year ago

Just following up, I clarified this in the README. thanks again!

https://github.com/allenai/mmc4#the-missing-shards-%EF%B8%8F