Open pfischer-nvidia opened 1 year ago
Thanks for raising this @pfischer-nvidia ! staleness of web datasets is indeed an issue for image corpora; thank you for flagging and quantifying the missing percent. This seems a tad higher than the conceptual captions corpus. Let me do a bit of thinking about potential solutions that balance training requirements vs. legal/ethical concerns.
(updated the title slightly to be slightly more specific :-) )
Hi @jmhessel,
Same here. We have about 20% images missing, which is not only in some examples but in many examples. The images cannot be found with a 404 error.
Besides, I find that quite a lot of URLs lead to re-directed websites. For example: url=https://www.printworxuk.com/product/roller-banner/ one image raw_url=https://www.printworxuk.com/wp-content/uploads/2017/06/roller-banner.png.
How can we deal with it?
Best Regards, Runpei
Btw. there are websites like the Wayback machine that do crawl and store websites including images. So I don't immediately see an issue with storing and re-distributing those images.
Hi @pfischer-nvidia,
Thanks for your reply.
However, how do you download the images to your local machine? Do you just directly retrieve the image with the URL?
Btw, I tried Wayback machine and I get this:
It seems many information and images are lost.
My point was rather that it should not be a problem to store and re-distribute images from web sites in general. I didn't suggest to download them from the archive.
Hi @pfischer-nvidia and @RunpeiDong ---
just quickly getting back
So I don't immediately see an issue with storing and re-distributing those images.
as previously mentioned, one issue is copyright: I am not a lawyer, but our legal team would not allow us to release raw images due to copyright concerns (I don't know what the wayback machine's situation is). Other popular datasets like conceptual captions are subject to the same release strategy. The missing ~20% of images is unfortunate, and I really am thinking about what might be a good balance going forward.
Hi @jmhessel,
Thanks for your kind reply! Sure, I understand the legal issues, especially in current society, since AI is rapidly developing with so many web datasets used for training.
I really appreciate your hard work, and I am looking forward to your solutions.
Best, Runpei
Originally posted by @jmhessel in https://github.com/allenai/mmc4/issues/10#issuecomment-1550204615
To answer your question: I checked how many docs/samples there are in the originally published jsonl files, vs. how many intact docs we were able to extract. For the full dataset (incl. faces), the percentage of missing samples is 16.8%, so quite high. I counted the unique URLs in the original data and the unique URLs in our dataset which has been filtered for