edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
65 stars 12 forks source link

Some original headers are getting lost #98

Closed Mr0grog closed 2 years ago

Mr0grog commented 2 years ago

It looks like something has changed about either Requests or the Wayback Machine, and we are no longer including all the original archived headers in a Memento object’s headers property. For example:

from wayback import WaybackClient
c = WaybackClient()

memento = c.get_memento('https://robbrackett.com/', datetime='20220315020402')
memento.headers
# {'Content-Type': 'text/html'}

But the value of memento.headers should really be something like:

{'date': 'Tue, 15 Mar 2022 02:04:02 GMT', 'server': 'Apache', 'upgrade': 'h2,h2c', 'connection': 'Upgrade, Keep-Alive', 'last-modified': 'Mon, 30 Nov 2020 22:51:03 GMT', 'accept-ranges': 'bytes', 'content-length': '13182', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=15, max=768', 'Content-Type': 'text/html'}

(Based on https://web.archive.org/web/20220315020402id_/http://robbrackett.com/)

Mr0grog commented 2 years ago

At a quick glance, it looks like archive.org has started returning x-archive-orig-* headers with lower-case header names, and we are looking for capitalized ones (which they used to be):

https://github.com/edgi-govdata-archiving/wayback/blob/bce65fda43138dd355034e29e2c2154cf5de1b64/wayback/_models.py#L273-L277

I’m guessing this started happening when they added HTTP/2 support (in HTTP/2, all header names are lower-case). That said, we can’t just switch to looking for lower-case here, since archive.org’s HTTP/1.1 responses still include upper-cased names for standard headers like Date and Location.