HoverHell / RedditImageGrab

Downloads images from sub-reddits of reddit.com.
GNU General Public License v3.0
311 stars 78 forks source link

Fixed & parsing problem in URLs #67

Closed frozenpandaman closed 8 years ago

frozenpandaman commented 8 years ago

Now correctly downloads images with URLs containing & (i.e. images uploaded using "choose file" option on link submission pages) by changing this text into & (and appending an arbitrary extension to downloaded files). Previous gave HTTP Error 401: Unauthorized.

rachmadaniHaryono commented 8 years ago

do you have example, reddit thread id?

frozenpandaman commented 8 years ago

@rachmadaniHaryono sure, here you go: https://m.reddit.com/r/mildlyinteresting/comments/58j5d2/this_ad_says_that_osteoporosis_has_no_symptoms/

You should be able to find other examples on the front page of nearly any subreddit where pictures are regularly submitted. The URL begins with i.reddituploads.com.

rachmadaniHaryono commented 8 years ago
frozenpandaman commented 8 years ago

@rachmadaniHaryono No, not only for that domain – that's just an example of a very common one found all over reddit (ever since they recently added the "upload image" button, as opposed to users having to upload their images to, say, Imgur). My fix works for any domain/URL with the substring & in it.

I hadn't seen #59 (and can't try it out at the moment). Is it essentially doing the same thing?

rachmadaniHaryono commented 8 years ago

My fix works for any domain/URL with the substring & in it

afaik the only known issue regarding url and ampersand is only this one. if there is any other domain, which needed this change, user can open new issue.

the disadvantage of applying this to global all domain, is that we don't know if there will be error for other domain which actually require & rather than the actual ampersand character

I hadn't seen #59 (and can't try it out at the moment). Is it essentially doing the same thing?

it is just plugin to download reddit upload image (include these domain i.redditmedia.com, i.reddituploads.com, i.redd.it). the default downloader failed, so it require request library

frozenpandaman commented 8 years ago

@rachmadaniHaryono

if there is any other domain, which needed this change, user can open new issue. I don't see why this would be a separate issue given all instances of & are caught by this fix. Any url can have an "&" in it for some reason instead of an "&", and as long as we catch this and account for it we're good. The problem here is that many URLs are not being grabbed correctly due to the fact Reddit has implemented a feature that uploads images and is using these URLs in a manner where many posts on every subreddit are affected.

Why would this not be the case for other domains? We need to translate the & into ^& for all occurrences in any URL – even if some other domain will still correctly display images if this is not done, it doesn't hurt.

See here (one of many sites about this): http://burnignorance.com/web-development-tips/using-ampersands-s-in-urls/

With HTML, the browser translates “&” to “&” so the Web server would only see “&” and not “&” in the query string of the request.

There's no "browser" here to do this conversion, however – we're grabbing the URL directly from the HTML where & characters must be escaped as & – so we need to be translating & to &, as it states.

rachmadaniHaryono commented 8 years ago

Why would this not be the case for other domains? We need to translate the & into ^& for all occurrences in any URL – even if some other domain will still correctly display images if this is not done, it doesn't hurt

It is more 'if it ain't broke don't fix it'.

if this is not done, it doesn't hurt

I'm just cautious of it. Because if something can go wrong it will.

we're grabbing the URL directly from the HTML

Actually the data are from reddit json response not html , see https://github.com/HoverHell/RedditImageGrab/blob/master/redditdownload/reddit.py

So I can't say if that also apply to the response.

But for the final decision, I will leave it to @HoverHell.

Other question : why 'extneeded' default to '.jpg' ?

HoverHell commented 8 years ago

even if some other domain will still correctly display images if this is not done, it doesn't hurt.

It might hurt, because of the

the data are from reddit json response not html

it might as well be considered a bug in the reddit's code if it gives out an incorrect url (especially for its own image domain).

HoverHell commented 8 years ago

Can someone, as an experiment, post a link with literal &, e.g. https://example.com/nowhere/?param=1%26amp%3B2 , to some subreddit and see what the API returns?

HoverHell commented 8 years ago

... did it myself, here's an example (for now) with literal & that was intended: https://www.reddit.com/r/test.json

Here's an url with unintended &s (usually): https://www.reddit.com/r/mildlyinteresting.json

Edit: we might need a whole HTMLParser.HTMLParser().unescape; testing further.

Edit2: yeah, I suppose so.

In [16]: import requests, HTMLParser; url = "https://www.reddit.com/r/test.json"; data = requests.get(url, headers={'User-Agent': 'myself'}).json(); print(data.keys()); post = [post for post in data['data']['children'] if post['data']['id'] == '58mpwu'][0]; post_url = post['data']['url']; post_url, HTMLParser.HTMLParser().unescape(post_url)
[u'kind', u'data']
Out[16]:
(u'https://example.com/nowhere?amp=&__&__lt=<__<__gt=>__>__lpar=(__(__rpar=)__)__comma=,__,__',
 u'https://example.com/nowhere?amp=&__&amp;__lt=<__&lt;__gt=>__&gt;__lpar=(__&lpar;__rpar=)__&rpar;__comma=,__&comma;__')

In [17]: post['data']['title']
Out[17]: u'test 3 of htmlesc: https://example.com/nowhere?amp=&amp;__&amp;amp;__lt=&lt;__&amp;lt;__gt=&gt;__&amp;gt;__lpar=(__&amp;lpar;__rpar=)__&amp;rpar;__comma=,__&amp;comma;__'

so as of now, the url reddit returns is html-escaped (as if it is meant to be inserted in <a href="{url}"> without additional escaping).

Side note: reddit seems to ban python-requests by user-agent, replying with 429 “Too Many Requests”, and no, I don't think that's an intended pun.

rachmadaniHaryono commented 8 years ago

is there reddit thread example where the escaped ampersand is not on url query? most what i see, the escaped ampersand is only on query part of the url, so we don't have to fully replace the &amp; on the full url.

url: "https://i.reddituploads.com/24b35a14058848cc90b2a3b1eea19244?fit=max&amp;h=1536&amp;w=1536&amp;s=889956bb8784dc2d45b98e9a815bc06a",
HoverHell commented 8 years ago

is there reddit thread example where the escaped ampersand is not on url query?

Try making one. It's not considered a good practice, but it is possible. I expect it'll be quoted just the same.

rachmadaniHaryono commented 8 years ago

url input http://example.com/somewher&anywhere?to=a&b

json url https://www.reddit.com/r/test/comments/58n4k7/test_amperssand/.json

dict from json : url: "http://example.com/somewher&amp;anywhere?to=a&amp;b"

HoverHell commented 8 years ago

I wonder if this is intended from reddit after all. It might as well be a side effect of them implementing the mobile version of the site or something like that.

HoverHell commented 8 years ago

Was there an issue somewhere about file extensions in general?

What I'd suggest is, when the extension needs to be fixed, to add a correct extension on top of the url's one (e.g. imgur/abcd.jpg, identified as png, to be renamed to abcd.jpg.png, and before downloading imgur/abcd.jpg, check for existence of glob abcd.jpg*). Still, requires more thinking. Unless we want to implement required metadata files or something (which I'd rather not)

But first I should finish the plugins system.