Closed h6197627 closed 2 years ago
@h6197627 This is a known bug of the Wayback Machine's Availablity API and not of this package. We already implemented retrying but sometimes that doesn't help either https://github.com/akamhy/waybackpy/blob/f63c6adf7933ccbf5c8c754b86d558a3d01748f8/waybackpy/availability_api.py#L170-L190
Do you have any suggestions?
I see. After looking into waybackpy code I understood that my code is a bit redundant, I should use
response = availability_api.oldest()
if response.json['archived_snapshots']:
arch_url = str(response)
That way I am having only one API call and therefore this inconsistency is avoided because there is no second call
Of cause it is not a true solution but a workaround
TIPS: In 3.0.3, class method WaybackMachineAvailabilityAPI.json
was moved to WaybackMachineAvailabilityAPI.setup_json
.
And also class variable WaybackMachineAvailabilityAPI.json
was unchanged.
So availability_api.json()
raises TypeError: 'NoneType' object is not callable
or TypeError: 'dict' object is not callable
.
As a suggestion maybe a bit better handling of such situation (inconsistency between subsequent calls with the same WaybackMachineAvailabilityAPI object):
if user calls some method that will initiate new request to Wayback Machine, we can check if self.json
structure is already filled and if yes, save some previous availability flag. If after second call resource availability is different compared to the saved flag, somehow handle it (maybe new type of exception like AvailabilityAPIInternalErorr
)
@h6197627 Why not use reliable CDX Server API instead? It's more reliable as it is somewhat complex when compared to the Availablity API and thus very few people know how to use it and the server can handle requests gracefully.
>>>
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://www.google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=1994, limit=1)
>>> snapshots = cdx.snapshots()
>>> oldest_archive = None # For cases when no archive found in the CDX server
>>> for snapshot in snapshots:
... oldest_archive = snapshot.archive_url
... break
...
>>> oldest_archive
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> type(oldest_archive)
<class 'str'>
>>>
In the above code, the limit=1 makes sure that the request only returns one URL per one CDX server request., and as we only care about the oldest archive we break out of the loop on the first iteration. Also, if no archive is found oldest_archive would be None.
If you have doubts about the CDX server API feel free to ask more here.
Thanks @akamhy, I will take a look!
This code:
can occasionally throw:
Next time re-running with the same URL goes as expected (URL actually archived by Wayback Machine)
Version: