Some url could not be found anymore (404 Not Found).

zhongyy commented 5 years ago

I have found about 300 images "404 not found" when I download the first 2000 images in the .csv file. Besides some links are duplicated. We are interested in your database and trying to use it in research. Would you like to provide the original images?

shineway14 commented 5 years ago

@fwang91 URL return "404" Would you like to provide the original images?

fwang91 commented 5 years ago

I have also found this problem. The URL is crawled in the last year and maybe some URLs have been blocked. We cannot directly send the original images to the face recognition community due to the copyright issues.

If there is a better way to release dataset and avoid the copyright issues, please contact me.

HaoLiuHust commented 5 years ago

many links can not retrieve, sometimes 404 not found, sometimes timeout

hustzeyu commented 5 years ago

@fwang91 So, why don't you upload the whole clean dataset (in .jpg) and the corresponding landmarks to 百度网盘?

fwang91 commented 5 years ago

@hustzeyu we do not have the copyright of the original images.

noeagles commented 5 years ago

@fwang91 I think it‘s ok if we only use it for acadeimic purpose?

hustzeyu commented 5 years ago

Hmmm ... we have already used MS1M dataset for acadeimic purpose, is the other part a problem?

mikeseese commented 5 years ago

Definitely not okay for academic purposes just because. Remember "academic" is another word for "commercial" since universities use research to get grant money, sell licenses, etc. Even if it wasn't recognized as "commercial", IMDb states anything non-personal needs explicit permission.

IMDb Conditions of Use explicitly states:

All content included on this site in or made available through any IMDb Service, such as text, graphics, logos, button icons, images, audio clips, video clips, digital downloads, data compilations, and software, is the property of IMDb or its content suppliers and protected by United States and international copyright laws.

And further:

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

And to get consent:

Licensing IMDb Content; Consent to Use Robots and Crawlers: If you are interested in receiving our express written permission to use IMDb content for your non-personal (including commercial) use, please visit our Content Licensing section or contact our Licensing Department. We do allow the limited use of robots and crawlers, such as those from certain search engines, with our express written consent. If you are interested in receiving our express written permission to use robots or crawlers on our site, please contact our Licensing Department.

So that really means that the use of this dataset (as it requires you to make a script to download the images from the scrapped URLs) for anything non-personal (including academic and commercial) requires explicit written permission if you want to remain above the radar. Otherwise you're open for lawsuit from Amazon (who owns IMDb).

In other words, none of us can rehost the images without invoking copyright infringement, and it likely requires more effort than I think is worth for @fwang91 et al to fix the URLs if the URLs just simply changed.

fwang91 commented 5 years ago

@seesemichaelj thanks for your detailed explanation.

hustzeyu commented 5 years ago

expecting updated url

fwang91 commented 5 years ago

@zhongyy hi，could you tell me the number of invalid URL?

zhongyy commented 5 years ago

@ Thanks for your attention. We have downloaded the most of the dataset and we are making a list of the invalid URL. May I email it to you "cloud9166@gmail.com" ?

fwang91 commented 5 years ago

Yes. Thanks a lot.

beszedes commented 5 years ago

Hire you can find list of URLs that finished with 404 error for me: https://drive.google.com/file/d/0B9JPNVxgMmu6T3FoMWZZUi00TDQ/view?usp=sharing

jensph commented 4 years ago

Unfortunately one year later out of 1,662,888 URLs I was able to download 1,180,173, so there are some 482k invalid URLs... I could generate a list of the invalid URLs, but given that it's almost a third of the total this may no longer be useful.

As an example, most of the Tom_Paolino images are not available. That's subject ID nm0660057.

fwang91 / IMDb-Face

Some url could not be found anymore (404 Not Found). #3