Missing or broken images (due to stale URLs)

allenai / mmc4

MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.

MIT License

904 stars 34 forks source link

Missing or broken images (due to stale URLs) #14

Open pfischer-nvidia opened 1 year ago

pfischer-nvidia commented 1 year ago

For raw images, the main readme has a raw image interest list. For copyright/legal reasons, I can't directly distribute images. Can you provide some statistics about what percentage of missing images you're finding? If a very high number are missing, I can do more thinking about potential solutions.

Originally posted by @jmhessel in https://github.com/allenai/mmc4/issues/10#issuecomment-1550204615

To answer your question: I checked how many docs/samples there are in the originally published jsonl files, vs. how many intact docs we were able to extract. For the full dataset (incl. faces), the percentage of missing samples is 16.8%, so quite high. I counted the unique URLs in the original data and the unique URLs in our dataset which has been filtered for

Samples that are missing one or more images (could not be downloaded)
Samples that contain images that cannot be loaded, or decoded, or are missing part of the image

jmhessel commented 1 year ago

Thanks for raising this @pfischer-nvidia ! staleness of web datasets is indeed an issue for image corpora; thank you for flagging and quantifying the missing percent. This seems a tad higher than the conceptual captions corpus. Let me do a bit of thinking about potential solutions that balance training requirements vs. legal/ethical concerns.

jmhessel commented 1 year ago

(updated the title slightly to be slightly more specific :-) )

RunpeiDong commented 1 year ago

Hi @jmhessel,

Same here. We have about 20% images missing, which is not only in some examples but in many examples. The images cannot be found with a 404 error.

Besides, I find that quite a lot of URLs lead to re-directed websites. For example: url=https://www.printworxuk.com/product/roller-banner/ one image raw_url=https://www.printworxuk.com/wp-content/uploads/2017/06/roller-banner.png.

How can we deal with it?

Best Regards, Runpei

pfischer-nvidia commented 1 year ago

Btw. there are websites like the Wayback machine that do crawl and store websites including images. So I don't immediately see an issue with storing and re-distributing those images.

RunpeiDong commented 1 year ago

Hi @pfischer-nvidia,

Thanks for your reply.

However, how do you download the images to your local machine? Do you just directly retrieve the image with the URL?

Btw, I tried Wayback machine and I get this:

截屏2023-05-25 18 02 58

It seems many information and images are lost.

pfischer-nvidia commented 1 year ago

My point was rather that it should not be a problem to store and re-distribute images from web sites in general. I didn't suggest to download them from the archive.

jmhessel commented 1 year ago

Hi @pfischer-nvidia and @RunpeiDong ---

just quickly getting back

So I don't immediately see an issue with storing and re-distributing those images.

as previously mentioned, one issue is copyright: I am not a lawyer, but our legal team would not allow us to release raw images due to copyright concerns (I don't know what the wayback machine's situation is). Other popular datasets like conceptual captions are subject to the same release strategy. The missing ~20% of images is unfortunate, and I really am thinking about what might be a good balance going forward.

RunpeiDong commented 1 year ago

Hi @jmhessel，

Thanks for your kind reply! Sure, I understand the legal issues, especially in current society, since AI is rapidly developing with so many web datasets used for training.

I really appreciate your hard work, and I am looking forward to your solutions.

Best, Runpei