IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 46 forks source link

Some calendars return 401 which crashes the archiver #43

Closed d235j closed 4 years ago

d235j commented 4 years ago

I'm having this happen with the Alps group (which is unfortunately private).

IgnoredAmbience commented 4 years ago

PR to fix welcomed :)

d235j commented 4 years ago

Found the problem.

get_calendars() relies on an error request (returning 401 or 403) to obtain the correct wssid parameter.

https://github.com/IgnoredAmbience/yahoo-group-archiver/commit/70cc682996e6206869e7192cb8590b557ff47746 changed get_file() to download_file() which doesn't handle the wssid properly — see https://github.com/IgnoredAmbience/yahoo-group-archiver/blob/d0644995977b808969e6e0c7b44a0fe780273bac/yahoo.py#L372 . I can PR soon.

IgnoredAmbience commented 4 years ago

Ah, I hadn't realised what was going on there when I merged that change. Just reintroducing the error-friendly version of the get_file should do.

d235j commented 4 years ago

@IgnoredAmbience what are your thoughts on making download_file return the content even when there is an error that doesn't go away on retry, so that it can be stored? (Or is that unnecessary as we're storing that in warc?)

IgnoredAmbience commented 4 years ago

The raised exception from requests should have the response and request objects available on it to query. It should be possible to do this from an exception handler when we're to expect a failure.