akamhy / waybackpy

Wayback Machine API interface & a command-line tool
https://pypi.org/project/waybackpy/
MIT License
453 stars 32 forks source link

Availability API instability #154

Closed h6197627 closed 2 years ago

h6197627 commented 2 years ago

This code:

from waybackpy import WaybackMachineAvailabilityAPI
url = 'http://address.com/some_disappeared_but_archived_image.jpg'
availability_api = WaybackMachineAvailabilityAPI(
  url,
  'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) '
  'Gecko/20100101 Firefox/97.0'
)
if availability_api.json()['archived_snapshots']:
    arch_url = str(availability_api.oldest())

can occasionally throw:

Traceback (most recent call last):
  ...
    arch_url = str(availability_api.oldest())
  File "/usr/local/lib/python3.6/dist-packages/waybackpy/availability_api.py", line 53, in __str__
    return self.archive_url
  File "/usr/local/lib/python3.6/dist-packages/waybackpy/availability_api.py", line 130, in archive_url
    + "\nResponse data:\n{response}".format(response=self.response.text)
waybackpy.exceptions.ArchiveNotInAvailabilityAPIResponse: Archive not found in the availability API response, the URL you requested may not have any archives yet. You may retry after some time or archive the webpage now.
Response data:
{"url": "http://address.com/some_disappeared_but_archived_image.jpg", "archived_snapshots": {}, "timestamp": "199402152345"}

Next time re-running with the same URL goes as expected (URL actually archived by Wayback Machine)

Version:

akamhy commented 2 years ago

@h6197627 This is a known bug of the Wayback Machine's Availablity API and not of this package. We already implemented retrying but sometimes that doesn't help either https://github.com/akamhy/waybackpy/blob/f63c6adf7933ccbf5c8c754b86d558a3d01748f8/waybackpy/availability_api.py#L170-L190

Do you have any suggestions?

h6197627 commented 2 years ago

I see. After looking into waybackpy code I understood that my code is a bit redundant, I should use

response = availability_api.oldest()
if response.json['archived_snapshots']:
    arch_url = str(response)

That way I am having only one API call and therefore this inconsistency is avoided because there is no second call

Of cause it is not a true solution but a workaround

eggplants commented 2 years ago

TIPS: In 3.0.3, class method WaybackMachineAvailabilityAPI.json was moved to WaybackMachineAvailabilityAPI.setup_json. And also class variable WaybackMachineAvailabilityAPI.json was unchanged. So availability_api.json() raises TypeError: 'NoneType' object is not callable or TypeError: 'dict' object is not callable.

h6197627 commented 2 years ago

As a suggestion maybe a bit better handling of such situation (inconsistency between subsequent calls with the same WaybackMachineAvailabilityAPI object): if user calls some method that will initiate new request to Wayback Machine, we can check if self.json structure is already filled and if yes, save some previous availability flag. If after second call resource availability is different compared to the saved flag, somehow handle it (maybe new type of exception like AvailabilityAPIInternalErorr)

akamhy commented 2 years ago

@h6197627 Why not use reliable CDX Server API instead? It's more reliable as it is somewhat complex when compared to the Availablity API and thus very few people know how to use it and the server can handle requests gracefully.

>>> 
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://www.google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=1994, limit=1)
>>> snapshots = cdx.snapshots()
>>> oldest_archive = None # For cases when no archive found in the CDX server
>>> for snapshot in snapshots:
...     oldest_archive = snapshot.archive_url
...     break
... 
>>> oldest_archive
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> type(oldest_archive)
<class 'str'>
>>> 

In the above code, the limit=1 makes sure that the request only returns one URL per one CDX server request., and as we only care about the oldest archive we break out of the loop on the first iteration. Also, if no archive is found oldest_archive would be None.

If you have doubts about the CDX server API feel free to ask more here.

h6197627 commented 2 years ago

Thanks @akamhy, I will take a look!