edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
61 stars 12 forks source link

`Memento.url` property can be wrong if it is SURT-equivalent to the actual URL #99

Closed Mr0grog closed 1 year ago

Mr0grog commented 1 year ago

If you request a memento URL with a SURT form that is equivalent to the memento’s actual URL, the url property of the resulting memento object is incorrect — it reflects the URL you requested, rather than the actual, captured URL.

For example:

from wayback import WaybackClient
c = WaybackClient()

memento = c.get_memento('http://robbrackett.com/', datetime='20220315020402')
memento.url
# 'http://robbrackett.com/'
# But the actual capture was from:
# 'https://robbrackett.com/'

# The `link` header has the right info:
memento._raw_headers['link']
# '<https://robbrackett.com/>; rel="original", ...'

The right details are in the link header, and we should be parsing that. We’ve had a feature request to do that for a while (#57), but I hadn’t realized that there was a bug like this that we have to do it to properly work around.