edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
65 stars 12 forks source link

Memento links should be in same mode as Memento #111

Closed Mr0grog closed 1 year ago

Mr0grog commented 1 year ago

In #108, I added a link property to Memento objects with parsed data from the Link HTTP headers of mementos. However, the links to other mementos in that data turn out to always be in view mode, regardless of the mode of the memento you requested!

For example:

from wayback import WaybackClient, Mode
client = WaybackClient()

memento = client.get_memento('https://epa.gov/', '20230210003633')

# Memento is in original mode:
memento.mode == Mode.original.value
# But the links are not:
memento.links == {
    'original': {
        'url': 'https://www.epa.gov/',
        'rel': 'original'
    },
    'timemap': {
        'url': 'https://web.archive.org/web/timemap/link/https://www.epa.gov/',
        'rel': 'timemap',
        'type': 'application/link-format'
    },
    'first memento': {
        # This URL is in `view` mode, not `original`!
        'url': 'https://web.archive.org/web/19970418120600/http://www.epa.gov:80/',
        'rel': 'first memento',
        'datetime': 'Fri, 18 Apr 1997 12:06:00 GMT'
  },
  # ...more links cut for brevity...
}

The suggested use for these links is to pass them directly to get_memento(), but that might get you a memento in a different mode than you expect! It’s a footgun.

Some options here:

  1. Drop the links attribute on Memento for now. Users can parse the Link header(s) themselves if they want it, and are responsible for using them appropriately. (In this case, we also need to reopen #57.)

  2. Update the url field on any link that references a memento to match the mode of the Memento object they are attached to.

    Side note: how do we identify which things are mementos? Look for "memento" as a substring in the rel field? Look for url fields that match known memento URL patterns?

  3. Instead of the values in links being dictionaries, make them some more useful data object. References to other mementos might be more like our CdxRecord objects, where the url is the captured URL (e.g. http://www.epa.gov/ instead of the memento URL), the timestamp is a datetime object, etc.

    • This one’s pretty complicated! It’s how I envisioned this feature might evolve, but isn’t obviously worthwhile in the short term.
    • I don’t know the complete universe of possible object types (it’s not just mementos, see the first two entries in the example above) and technically what goes here is pretty arbitrary. How do we future-proof things we haven’t modeled yet?

I think (3) has too many open questions, but we should do (1) or (2) before cutting a 0.4.1 release.