IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 45 forks source link

500 error when fetching image attachments #5

Closed philpem closed 4 years ago

philpem commented 5 years ago

It appears Yahoo are doing some kind of referrer checking.

If a message contains an image attachment, yahoo.py will try to download it. This will result in a 500 Internal Server Error.

** Fetching attachment 'DSC07465.JPG'
Traceback (most recent call last):
  File "./yahoo.py", line 192, in <module>
    archive_email(yga, reattach=(not args.no_reattach), save=(not args.no_save))
  File "./yahoo.py", line 52, in archive_email
    atts[attach['filename']] = yga.get_file(photoinfo['displayURL'])
  File "/mnt/zfs2/ygroups/yahoo-group-archiver/yahoogroupsapi.py", line 49, in get_file
    r.raise_for_status()
  File "/home/philpem/.local/lib/python2.7/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://xa.yimg.com/kq/groups/NgEGp_Hue9axOiNJrw--/or/skiVJcbofNa3MYMOEEQ-/name/DSC07465.JPG

Opening the image URL in a browser shows this error message:

We are sorry, you can not display images hosted by Yahoo! Groups on non Yahoo! Groups pages
philpem commented 5 years ago

I've just tried to download the image in a browser and got a 404 error. The thumbnails load but not the image.

It seems Yahoo are starting to delete things already...

jmay commented 5 years ago

I avoided this by adding the --no-reattach and --no-save parameters to yahoo.py.

inkblot14 commented 5 years ago

I got a 500 Server Error on some attachments too. After looking up the message, I could manually open the image attachment but I noticed the URL was different from what the script was using. Based on the difference, I tried this modification in yahoogroupsapi.py and it seems to be working so far:

    def get_file(self, url):
        newurl = url.replace("/or/","/hr/")
        r = self.s.get(newurl)
        r.raise_for_status()
        return r.content
codycooperross commented 5 years ago

Maybe inkblot14's method was patched? Still getting the following error when trying to download images with "/or/" replaced with "/hr/" in the URL:

requests.exceptions.SSLError: HTTPSConnectionPool(host='xa.yimg.com', port=443): Max retries exceeded with url: /kq/groups/VXRh1NjuftMqZ99a/hr/jC5PKZ_tedbs7qvzMxNZ/name/IMG_0695.jpg?download=1 (Caused by SSLError(SSLEOFError(8, u'EOF occurred in violation of protocol (_ssl.c:590)'),))

Any ideas? The group we're trying to archive is private.

inkblot14 commented 5 years ago

I don't know if it helps, but the way I figured out the substitution was to open the message in YG webpage and open the attachment from there - compared it with the error msg from the script to see how the URL was different. I'd try that with your group, and see if maybe there's something other than "/or/" and "/hr/"...maybe it's different for each group.

IgnoredAmbience commented 5 years ago

@codycooperross:

Maybe inkblot14's method was patched? Still getting the following error when trying to download images with "/or/" replaced with "/hr/" in the URL:

requests.exceptions.SSLError: HTTPSConnectionPool(host='xa.yimg.com', port=443): Max retries exceeded with url: /kq/groups/VXRh1NjuftMqZ99a/hr/jC5PKZ_tedbs7qvzMxNZ/name/IMG_0695.jpg?download=1 (Caused by SSLError(SSLEOFError(8, u'EOF occurred in violation of protocol (_ssl.c:590)'),)) Not sure about this one, given that it's an SSL error. Accessing that URL directly with an additional Referer: http://groups.yahoo.com/ header works for me in the browser. SSL connection issues were potentially being triggered by lack of certificate pinning from the Yahoo server, but I hadn't seen one with that exact SSL error before. It's unrelated to the 500 errors in the OP.

@philpem:

It appears Yahoo are doing some kind of referrer checking.

If a message contains an image attachment, yahoo.py will try to download it. This will result in a 500 Internal Server Error.

...
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://xa.yimg.com/kq/groups/NgEGp_Hue9axOiNJrw--/or/skiVJcbofNa3MYMOEEQ-/name/DSC07465.JPG

Opening the image URL in a browser shows this error message:

We are sorry, you can not display images hosted by Yahoo! Groups on non Yahoo! Groups pages

I'm not able to reproduce the 500 error for this URL. I get the 404 error in the browser, but it goes away if I manually add a Yahoo Groups referer header.

I'd welcome further reports on errors when downloading images, especially if they're reliably triggered.

philpem commented 5 years ago

@IgnoredAmbience My fork actually fixes this. The issue is server-side. Yahoo seems to return 500 errors for some files - but if you make the request then retry after a short while, it'll usually work.

logological commented 5 years ago

@IgnoredAmbience My fork actually fixes this. The issue is server-side. Yahoo seems to return 500 errors for some files - but if you make the request then retry after a short while, it'll usually work.

I'm also running up against this problem and can confirm that retrying later often helps. Perhaps you could contribute your fix here?

samuelcole commented 5 years ago

additionally, whenever i get a 500, the script seems to infinitely loop retrying the image, i would prefer to skip it and move on

2019-10-27 19:24:10.747 EDT ERROR process_single_attachment ERROR downloading 'https://xa.yimg.com/kq/groups/adopXwvteNB1Qsz.aw--/or/qMLAVD3tctUT1.GDCIQj/name/mb86.jpg' variant or: 500 Server Error: Internal Server Error for url: https://xa.yimg.com/kq/groups/adopXwvteNB1Qsz.aw--/or/qMLAVD3tctUT1.GDCIQj/name/mb86.jpg
samuelcole commented 5 years ago

i think the problem is this line: https://github.com/IgnoredAmbience/yahoo-group-archiver/blob/55bbae30749bf71f514a3f034cab563aa2b4016b/yahoo.py#L178

it doesn't seem limited by TRIES

IgnoredAmbience commented 4 years ago

Believe this fixed in current master.