Closed philpem closed 4 years ago
I've just tried to download the image in a browser and got a 404 error. The thumbnails load but not the image.
It seems Yahoo are starting to delete things already...
I avoided this by adding the --no-reattach and --no-save parameters to yahoo.py.
I got a 500 Server Error on some attachments too. After looking up the message, I could manually open the image attachment but I noticed the URL was different from what the script was using. Based on the difference, I tried this modification in yahoogroupsapi.py and it seems to be working so far:
def get_file(self, url):
newurl = url.replace("/or/","/hr/")
r = self.s.get(newurl)
r.raise_for_status()
return r.content
Maybe inkblot14's method was patched? Still getting the following error when trying to download images with "/or/" replaced with "/hr/" in the URL:
requests.exceptions.SSLError: HTTPSConnectionPool(host='xa.yimg.com', port=443): Max retries exceeded with url: /kq/groups/VXRh1NjuftMqZ99a/hr/jC5PKZ_tedbs7qvzMxNZ/name/IMG_0695.jpg?download=1 (Caused by SSLError(SSLEOFError(8, u'EOF occurred in violation of protocol (_ssl.c:590)'),))
Any ideas? The group we're trying to archive is private.
I don't know if it helps, but the way I figured out the substitution was to open the message in YG webpage and open the attachment from there - compared it with the error msg from the script to see how the URL was different. I'd try that with your group, and see if maybe there's something other than "/or/" and "/hr/"...maybe it's different for each group.
@codycooperross:
Maybe inkblot14's method was patched? Still getting the following error when trying to download images with "/or/" replaced with "/hr/" in the URL:
requests.exceptions.SSLError: HTTPSConnectionPool(host='xa.yimg.com', port=443): Max retries exceeded with url: /kq/groups/VXRh1NjuftMqZ99a/hr/jC5PKZ_tedbs7qvzMxNZ/name/IMG_0695.jpg?download=1 (Caused by SSLError(SSLEOFError(8, u'EOF occurred in violation of protocol (_ssl.c:590)'),))
Not sure about this one, given that it's an SSL error. Accessing that URL directly with an additionalReferer: http://groups.yahoo.com/
header works for me in the browser. SSL connection issues were potentially being triggered by lack of certificate pinning from the Yahoo server, but I hadn't seen one with that exact SSL error before. It's unrelated to the 500 errors in the OP.
@philpem:
It appears Yahoo are doing some kind of referrer checking.
If a message contains an image attachment, yahoo.py will try to download it. This will result in a 500 Internal Server Error.
... requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://xa.yimg.com/kq/groups/NgEGp_Hue9axOiNJrw--/or/skiVJcbofNa3MYMOEEQ-/name/DSC07465.JPG
Opening the image URL in a browser shows this error message:
We are sorry, you can not display images hosted by Yahoo! Groups on non Yahoo! Groups pages
I'm not able to reproduce the 500 error for this URL. I get the 404 error in the browser, but it goes away if I manually add a Yahoo Groups referer header.
I'd welcome further reports on errors when downloading images, especially if they're reliably triggered.
@IgnoredAmbience My fork actually fixes this. The issue is server-side. Yahoo seems to return 500 errors for some files - but if you make the request then retry after a short while, it'll usually work.
@IgnoredAmbience My fork actually fixes this. The issue is server-side. Yahoo seems to return 500 errors for some files - but if you make the request then retry after a short while, it'll usually work.
I'm also running up against this problem and can confirm that retrying later often helps. Perhaps you could contribute your fix here?
additionally, whenever i get a 500, the script seems to infinitely loop retrying the image, i would prefer to skip it and move on
2019-10-27 19:24:10.747 EDT ERROR process_single_attachment ERROR downloading 'https://xa.yimg.com/kq/groups/adopXwvteNB1Qsz.aw--/or/qMLAVD3tctUT1.GDCIQj/name/mb86.jpg' variant or: 500 Server Error: Internal Server Error for url: https://xa.yimg.com/kq/groups/adopXwvteNB1Qsz.aw--/or/qMLAVD3tctUT1.GDCIQj/name/mb86.jpg
i think the problem is this line: https://github.com/IgnoredAmbience/yahoo-group-archiver/blob/55bbae30749bf71f514a3f034cab563aa2b4016b/yahoo.py#L178
it doesn't seem limited by TRIES
Believe this fixed in current master.
It appears Yahoo are doing some kind of referrer checking.
If a message contains an image attachment, yahoo.py will try to download it. This will result in a 500 Internal Server Error.
Opening the image URL in a browser shows this error message: