Closed frozenpandaman closed 8 years ago
do you have example, reddit thread id?
@rachmadaniHaryono sure, here you go: https://m.reddit.com/r/mildlyinteresting/comments/58j5d2/this_ad_says_that_osteoporosis_has_no_symptoms/
You should be able to find other examples on the front page of nearly any subreddit where pictures are regularly submitted. The URL begins with i.reddituploads.com.
redditupload
domain?if it is for other url does it should work for such url (not quite sure if this is valid, but just as precautius)
abc.com/adajvkajvalk&dafkja;ldskfja;/afdkjaldsfkjalkdsjf.jpg
@rachmadaniHaryono No, not only for that domain – that's just an example of a very common one found all over reddit (ever since they recently added the "upload image" button, as opposed to users having to upload their images to, say, Imgur). My fix works for any domain/URL with the substring &
in it.
I hadn't seen #59 (and can't try it out at the moment). Is it essentially doing the same thing?
My fix works for any domain/URL with the substring & in it
afaik the only known issue regarding url and ampersand is only this one. if there is any other domain, which needed this change, user can open new issue.
the disadvantage of applying this to global all domain, is that we don't know if there will be error for other domain which actually require &
rather than the actual ampersand character
I hadn't seen #59 (and can't try it out at the moment). Is it essentially doing the same thing?
it is just plugin to download reddit upload image (include these domain i.redditmedia.com
, i.reddituploads.com
, i.redd.it
). the default downloader failed, so it require request
library
@rachmadaniHaryono
if there is any other domain, which needed this change, user can open new issue. I don't see why this would be a separate issue given all instances of
&
are caught by this fix. Any url can have an "&" in it for some reason instead of an "&", and as long as we catch this and account for it we're good. The problem here is that many URLs are not being grabbed correctly due to the fact Reddit has implemented a feature that uploads images and is using these URLs in a manner where many posts on every subreddit are affected.
Why would this not be the case for other domains? We need to translate the &
into ^&
for all occurrences in any URL – even if some other domain will still correctly display images if this is not done, it doesn't hurt.
See here (one of many sites about this): http://burnignorance.com/web-development-tips/using-ampersands-s-in-urls/
With HTML, the browser translates “&” to “&” so the Web server would only see “&” and not “&” in the query string of the request.
There's no "browser" here to do this conversion, however – we're grabbing the URL directly from the HTML where &
characters must be escaped as &
– so we need to be translating &
to &
, as it states.
Why would this not be the case for other domains? We need to translate the & into ^& for all occurrences in any URL – even if some other domain will still correctly display images if this is not done, it doesn't hurt
It is more 'if it ain't broke don't fix it'.
if this is not done, it doesn't hurt
I'm just cautious of it. Because if something can go wrong it will.
we're grabbing the URL directly from the HTML
Actually the data are from reddit json response not html , see https://github.com/HoverHell/RedditImageGrab/blob/master/redditdownload/reddit.py
So I can't say if that also apply to the response.
But for the final decision, I will leave it to @HoverHell.
Other question : why 'extneeded' default to '.jpg' ?
even if some other domain will still correctly display images if this is not done, it doesn't hurt.
It might hurt, because of the
the data are from reddit json response not html
it might as well be considered a bug in the reddit's code if it gives out an incorrect url (especially for its own image domain).
Can someone, as an experiment, post a link with literal &
, e.g. https://example.com/nowhere/?param=1%26amp%3B2 , to some subreddit and see what the API returns?
... did it myself, here's an example (for now) with literal &
that was intended: https://www.reddit.com/r/test.json
Here's an url with unintended &
s (usually): https://www.reddit.com/r/mildlyinteresting.json
Edit: we might need a whole HTMLParser.HTMLParser().unescape
; testing further.
Edit2: yeah, I suppose so.
In [16]: import requests, HTMLParser; url = "https://www.reddit.com/r/test.json"; data = requests.get(url, headers={'User-Agent': 'myself'}).json(); print(data.keys()); post = [post for post in data['data']['children'] if post['data']['id'] == '58mpwu'][0]; post_url = post['data']['url']; post_url, HTMLParser.HTMLParser().unescape(post_url)
[u'kind', u'data']
Out[16]:
(u'https://example.com/nowhere?amp=&__&amp;__lt=<__&lt;__gt=>__&gt;__lpar=(__&lpar;__rpar=)__&rpar;__comma=,__&comma;__',
u'https://example.com/nowhere?amp=&__&__lt=<__<__gt=>__>__lpar=(__(__rpar=)__)__comma=,__,__')
In [17]: post['data']['title']
Out[17]: u'test 3 of htmlesc: https://example.com/nowhere?amp=&__&amp;__lt=<__&lt;__gt=>__&gt;__lpar=(__&lpar;__rpar=)__&rpar;__comma=,__&comma;__'
so as of now, the url
reddit returns is html-escaped (as if it is meant to be inserted in <a href="{url}">
without additional escaping).
Side note: reddit seems to ban python-requests by user-agent, replying with 429 “Too Many Requests”, and no, I don't think that's an intended pun.
is there reddit thread example where the escaped ampersand is not on url query? most what i see, the escaped ampersand is only on query part of the url, so we don't have to fully replace the &
on the full url.
url: "https://i.reddituploads.com/24b35a14058848cc90b2a3b1eea19244?fit=max&h=1536&w=1536&s=889956bb8784dc2d45b98e9a815bc06a",
is there reddit thread example where the escaped ampersand is not on url query?
Try making one. It's not considered a good practice, but it is possible. I expect it'll be quoted just the same.
url input http://example.com/somewher&anywhere?to=a&b
json url https://www.reddit.com/r/test/comments/58n4k7/test_amperssand/.json
dict from json : url: "http://example.com/somewher&anywhere?to=a&b"
I wonder if this is intended from reddit after all. It might as well be a side effect of them implementing the mobile version of the site or something like that.
Was there an issue somewhere about file extensions in general?
What I'd suggest is, when the extension needs to be fixed, to add a correct extension on top of the url's one (e.g. imgur/abcd.jpg, identified as png, to be renamed to abcd.jpg.png, and before downloading imgur/abcd.jpg, check for existence of glob abcd.jpg*
). Still, requires more thinking. Unless we want to implement required metadata files or something (which I'd rather not)
But first I should finish the plugins system.
Now correctly downloads images with URLs containing
&
(i.e. images uploaded using "choose file" option on link submission pages) by changing this text into&
(and appending an arbitrary extension to downloaded files). Previous gave HTTP Error 401: Unauthorized.