akamhy / waybackpy

Wayback Machine API interface & a command-line tool
https://pypi.org/project/waybackpy/
MIT License
453 stars 32 forks source link

CDX API handling of excluded from the Wayback Machine URLs #157

Closed h6197627 closed 2 years ago

h6197627 commented 2 years ago

URLs that was excluded from Wayback Machine are not handled properly using CDX Server API (Availability API is fine). Manual web user interface request returns:

Sorry.

This URL has been excluded from the Wayback Machine.

API request returns: org.archive.util.io.RuntimeIOException: org.archive.wayback.exception.AdministrativeAccessControlException: Blocked Site Error

waybackpy cdx_utils.py does not expect such response and crashes with exception:

Traceback (most recent call last):
  ...
    for snapshot in cdx.snapshots():
  File "/usr/local/lib/python3.6/dist-packages/waybackpy/cdx_api.py", line 144, in snapshots
    for text in texts:
  File "/usr/local/lib/python3.6/dist-packages/waybackpy/cdx_api.py", line 52, in cdx_api_manager
    total_pages = get_total_pages(self.url, self.user_agent)
  File "/usr/local/lib/python3.6/dist-packages/waybackpy/cdx_utils.py", line 15, in get_total_pages
    return int(response.text.strip())
ValueError: invalid literal for int() with base 10: 'org.archive.util.io.RuntimeIOException: org.archive.wayback.exception.AdministrativeAccessControlException: Blocked Site Error'

To Reproduce Sample URL: http://gotceleb.com

Version:

akamhy commented 2 years ago

@h6197627 This should probably raise a custom exception instead of ValueError. Maybe BlockedSiteError?

h6197627 commented 2 years ago

In my opinion it is better to simply return no snapshots available without exceptions, as from the user perspective, I think, it doesn't matter is it blocked or simply was not archived. Though in some use cases, which I am not aware of, it might be important to know this information.

h6197627 commented 2 years ago

But probably you are right, if API interpret this situation as an exception, then it is better to stay closer to API. I think custom BlockedSiteError is OK.