edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
61 stars 12 forks source link

Should get_memento() ignore the mode in archive.org URLs? #115

Open Mr0grog opened 1 year ago

Mr0grog commented 1 year ago

Currently, get_memento() can be called in a few different ways:

Folks using this library will usually want mode=Mode.original, which is what we typically do by default. BUT since an archive URL has the mode baked in, we obey whatever mode was in the URL.

The problem is that mode as a concept is a little advanced and requires extra thinking about what you’re asking for. Folks are prone to copying a URL from their browser and dropping it in here to try things out, or accidentally using cdx_record.view_url instead of just passing the CDX record directly without realizing that they are changing modes (or what that even means!). For example, #109 uncovered a legitimate issue with view mode, but the user didn’t actually want to be using view mode at all! (Once I explained that, it turned out the actual issue wasn’t even a blocker for him — he switched to original mode and was good to go.)

So: should calling get_memento(archived_url) ignore the mode that’s in the URL and use whatever one is explicitly set as a parameter instead (as in all other cases, defaulting to original)? For example:

client.get_memento("https://web.archive.org/web/20230101000000/https://www.epa.gov/")

Currently gets you a memento in view mode. The change I’m thinking about would mean you’d get original mode instead here. If you wanted view mode, you’d have to ask for it explicitly:

client.get_memento("https://web.archive.org/web/20230101000000/https://www.epa.gov/", mode=Mode.view)

It would also mean all these calls get you the same result, instead of different ones:

client.get_memento("https://web.archive.org/web/20230101000000/https://www.epa.gov/")
client.get_memento("https://web.archive.org/web/20230101000000id_/https://www.epa.gov/")
client.get_memento("https://web.archive.org/web/20230101000000js_/https://www.epa.gov/")
client.get_memento("https://web.archive.org/web/20230101000000cs_/https://www.epa.gov/")
client.get_memento("https://web.archive.org/web/20230101000000im_/https://www.epa.gov/")
# Note different mode values ---------------------------------^^^