akamhy / waybackpy

Wayback Machine API interface & a command-line tool
https://pypi.org/project/waybackpy/
MIT License
465 stars 34 forks source link

WaybackMachineCDXServerAPI.newest does not return latest snapshot #176

Open sissbruecker opened 2 years ago

sissbruecker commented 2 years ago

Describe the bug

Using WaybackMachineCDXServerAPI.newest does not return the last snapshot, but some recent snapshot. For example for https://openlayers.org/, it returns a snapshot from 2022-06-16 17:20:36, the latest snapshot (as of today, September 10th 2022) is from 2022-09-10 08:05:37. There are around 380 snapshots between these two.

I've debugged this a bit and it seems there is an issue either with how sort or limit are configured, or interpreted by the CDX server. The method sets sort = 'closest' and limit = 1. If I configure the WaybackMachineCDXServerAPI instance manually and set to limit = -1 instead, then I actually get the latest snapshot. https://github.com/akamhy/waybackpy/issues/155#issuecomment-1041882795 hints that limit = -1 should be used for the latest snapshot.

To Reproduce

url = 'https://openlayers.org/'
cdx_api = waybackpy.WaybackMachineCDXServerAPI(url)
newest_snapshot = cdx_api.newest()
print(newest_snapshot.datetime_timestamp)
# prints 2022-06-16 17:20:36, should be 2022-09-10 08:05:37

Workaround

url = 'https://openlayers.org/'
unix_timestamp = int(time.time())
timestamp = waybackpy.utils.unix_timestamp_to_wayback_timestamp(unix_timestamp)
cdx_api = waybackpy.WaybackMachineCDXServerAPI(url)
cdx_api.closest = timestamp
cdx_api.sort = 'closest'
cdx_api.limit = -1

for item in cdx_api.snapshots():
    print(item.datetime_timestamp)
    break

Expected behavior The newest API should return the newest snapshot.

Version:

sissbruecker commented 2 years ago

Hmm, with limit = -1 sometimes you don't get any result at all from the CDX API. For example:

http://web.archive.org/cdx/search/cdx?url=https://github.com/awslabs/aws-serverless-express&gzip=false&showResumeKey=true&limit=-1

returns an empty response.

However:

http://web.archive.org/cdx/search/cdx?url=https://github.com/awslabs/aws-serverless-express&gzip=false&showResumeKey=true&limit=-5

returns 5 entries.

The CDX API docs are not super clear, but that looks like a bug. A workaround could be to use a higher limit for newest, and then only take the first result.