HoverHell / RedditImageGrab

Downloads images from sub-reddits of reddit.com.
GNU General Public License v3.0
312 stars 80 forks source link

Skip images removed from imgur #50

Closed roperi closed 7 years ago

roperi commented 8 years ago

First of all, this is a great scrapper! Thanks you all guys for all the great work!

I have noticed images that no longer exist in imgur are redirected to http://i.imgur.com/removed.png -which is a 503 bytes size PNG image saved as a JPG- that says "The image you are requesting does not exist or is not longer available". I am getting this sort of error images even when using sort type topday.

I wonder if it worth to have a switch to skip images by file size. Thoughts?

EDIT: I meant to skip images smaller than few KB in size. But now that I think about it the idea seems a bit crazy beause images would have to be downloaded first. So what about a filter to filter out urls such as "http://i.imgur.com/removed.png" ?

rachmadaniHaryono commented 8 years ago

I wonder if it worth to have a switch to skip images by file size. Thoughts?

if it is the same image, it is better by md5 hash.

we can use this remotemd5

the pattern i know for this kind of removed image is only imgur.com/{image_id}, i don't know if imgur.com/gallery/{id} can also be redirected.

if only given random id such as imgur.com/{random_id} it will give 404 message, so we can exclude this pattern.

do you have example of removed imgur url other than imgur.com/remove.png?

I wonder if it worth to have a switch to skip images by file size. Thoughts?

it can be a feature. if you need it you can open new issue.

just note for this issue: the md5 of imgurl.com/removed.png is d835884373f4d6c8f24742ceabe74946

roperi commented 8 years ago

@rachmadaniHaryono

Thanks so much.

roperi commented 8 years ago

But should I close this issue or wait?

rachmadaniHaryono commented 8 years ago

technically it solved. you can close it

but from what i see, what often happen is, the creator will accept the pull request and ask for confirmation once again and the issue creator will check if the issue is solved. just after everything is check and solved, the issue is closed.

IMO just wait till next update, so this can be used as reminder for @HoverHell which pull requests he will accept for next update.

other thing you can do, is to ask @mfabinski or @jtara1 (py3) to create a branch to be merged with this branch.

i can also do that but it will take time. because i'm still working on # 53

jtara1 commented 8 years ago

I created a hook function to check if image downloaded is that specific imgur DNE image, however, I don't think my solution works and it's difficult to debug. Here's what I did to try and solve the problem.

Maybe another solution would be to to check & compare the first 503 bytes of each file (recursively) after they are downloaded.

My fork of RedditImageGrab has a few issues which I'll update in the readme in a second.

It's good to see there exists an imgur link to the dne image specifically, imgur hasn't allowed me to upload the dne image to their site. @rachmadaniHaryono Thanks a bunch for mentioning me here, this allowed me to debug my program.


Update My method for dealing with this is to compare the bytes of two local files to check if the test file is the imgur dne image.

Sometime later I'll check to see if it's more time efficient to compare md5 hashes or compare the bytes of two images.

Here's a tool for removing that Imgur DNE image from a collection of files in a path: DeleteImgurDNE.


Update 2 Fixed jtara1/imgur-downloader to open the url as a request, read & compare the bytes to a local file to check if it's the Imgur DNE image, and prevent downloading & saving the file if it is.

This should be the most efficient way to do this.

Note: My fork of RedditImageGrab is dependent on my fork of imgur-downloader so it's a fix for both programs.

HoverHell commented 8 years ago

Note that remotemd5 isn't at all remote: it pretty much downloads the whole file.

Redirect-catching might work, but might be too tricky.

The easiest way is to download the image and then check its md5 and if it's the DNE image then remove it as useless. (of course, pretty much the same can be done with a simple shellscript after the download)

roperi commented 8 years ago

@HoverHell, @jtara1, @rachmadaniHaryono

Since I further process the downloaded files, and probably some of us do the same, I check their integrity using Pillow:

from PIL import Image
image = Image.open(filename) 
            try:
                image.verify()
            except Exception as e:
                print(e)

I think that if the script downloads the whole file checking its integrity should be up to us.

HoverHell commented 7 years ago

Actually, there's an even easier solution for this particular issue:

$ curl -v -o /dev/null http://i.imgur.com/2gUGa.jpg
...
... < Location: http://i.imgur.com/removed.png

i.e. it is sufficient to check response.url == 'http://i.imgur.com/removed.png' at some point (probably in download_from_url).