edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
61 stars 12 forks source link

WaybackSession not properly handling archived 429 (rate limit) responses #158

Closed geosoco closed 7 months ago

geosoco commented 7 months ago

As per the memento spec and as noted in the comments in _client.py (line ~987), 4XX status responses should also be returned by the memento server exactly as they were archived. WaybackClient has code to work around this, but Wayback session currently assumes that all 429s are real and throws an exception instead of returning the response.

Here's an example archived link where reddit returned a 429 to the archive bot: https://web.archive.org/web/20150129034904/http://www.reddit.com/r/PokemonGiveaway

This happens in _client.py around line ~458 because it has the memento-datetime header the executation is gated through the if statements. The second only checks if the response code is 429 without seeing if it has memento-datetime, and then throws an exception instead of returning the response.

untested: I suspect there should be an outer if statement that checks for the memento-datetime, and if present always return the response as these are recorded errors and the client/calling code can handle if needed. In the else, the rest can be handled? I can't currently reproduce a real 429 from archive, so I can't validate whether this would fully solve the issue.

Mr0grog commented 7 months ago

Oh, good catch, and thanks for a real example! I definitely won’t have time to get to this until at least tomorrow (and maybe a couple of days), but if you have time to work on it, I am happy to review a pull request!

I suspect there should be an outer if statement that checks for the memento-datetime, and if present always return the response…

Yeah, I think you are correct here. The first thing WaybackSession.send() does with the response should be to immediately return it if is_memento(response) is True. Then the rest of the existing logic can stay as-is, although we should remove the is_memento() call from WaybackSession.should_retry() since it will now be redundant: https://github.com/edgi-govdata-archiving/wayback/blob/477c44439abdebaf4d9942c7cb2aa55c3099b846/src/wayback/_client.py#L500-L505

If you decide to work on a pull request (many thanks! 🙏), please be sure to include a test! Since you have a real-world example, the easiest thing to do is just use that same URL in the test, and add the @ia_vcr.use_cassette() decorator, like this test: https://github.com/edgi-govdata-archiving/wayback/blob/477c44439abdebaf4d9942c7cb2aa55c3099b846/src/wayback/tests/test_client.py#L353-L361

That will cause the test framework to record the actual server calls the first time you run it, then replay them for future test runs. (The recording will be a file named src/wayback/tests/cassettes/<name_of_test>.yml, so make sure to commit that new file.)