akamhy / waybackpy

Wayback Machine API interface & a command-line tool
https://pypi.org/project/waybackpy/
MIT License
453 stars 32 forks source link

Raise Error if response is 429 #131

Closed eggplants closed 2 years ago

eggplants commented 2 years ago

When access rate is too frequent, Wayback Machine returns 429 as HTTP status code.

And returned HTML Body is:

https://gist.github.com/eggplants/414bab0230f14358642faf364bc1f7ec

<h1>Too Many Requests</h1>

We are limiting the number of URLs you can submit to be Archived to the Wayback Machine, 
using the Save Page Now features, to no more than 15 per minute.
<p>
  If you submit more than that we will block Save Page Now requests from your IP number for 5 minutes.
</p>
<p>
  Please feel free to write to us at info@archive.org if you have questions about this.  
  Please include your IP address and any URLs in the email so we can provide you with better service.
</p>

So I suggest raising TooManyRequestsError when returned status code is 429.

akamhy commented 2 years ago

So I suggest raising TooManyRequestsError when returned status code is 429.

    def get_save_request_headers(self) -> None:
        """
        Creates a session and tries 'retries' number of times to
        retrieve the archive.
        If successful in getting the response, sets the headers, status_code
        and response_url attributes.
        The archive is usually in the headers but it can also be the response URL
        as the Wayback Machine redirects to the archive after a successful capture
        of the webpage.
        Wayback Machine's save API is known
        to be very unreliable thus if it fails first check opening
        the response URL yourself in the browser.
        """
        session = requests.Session()
        retries = Retry(
            total=self.total_save_retries,
            backoff_factor=self.backoff_factor,
            status_forcelist=self.status_forcelist,
        )
        session.mount("https://", HTTPAdapter(max_retries=retries))
        self.response = session.get(self.request_url, headers=self.request_headers)
        # requests.response.headers is requests.structures.CaseInsensitiveDict
        self.headers: CaseInsensitiveDict[str] = self.response.headers
        self.status_code = self.response.status_code
        self.response_url = self.response.url
        session.close()
        if self.status_code == 429:
            raise TooManyRequestsError("The error message here")

What should be the error message? Should it(error message) be parsed every time or should it be a string literal?

eggplants commented 2 years ago

Example: Save Page Now receives up to 15 URLs per minutes. Wait a moment and run again.

eggplants commented 2 years ago

I think just checking the code should be enough.

akamhy commented 2 years ago

@eggplants will you be working on this issue? Just asking so that we both don't end up creating two PRs.

eggplants commented 2 years ago

I'll do it myself if you'd like.

akamhy commented 2 years ago

I'll do it myself if you'd like.

Go ahead.

akamhy commented 2 years ago

For future reference

See also https://github.com/akamhy/waybackpy/pull/142#issuecomment-1031850965

429 doesn't always imply that we have hit 15 archives per minute, at least on my IP. It could also imply that the URL we are trying to archive has reached it maximum limit.