edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
61 stars 12 forks source link

Mementos of redirects in view mode raise "could not be played" error #109

Closed rhaksw closed 1 year ago

rhaksw commented 1 year ago

Hi, I'm getting an error from this code,

from wayback import WaybackClient, WaybackSession
wc = WaybackClient(session = WaybackSession(
                                user_agent='agent-218947',
                                timeout=10,
                             ))
u='https://web.archive.org/web/20230212225711/https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/'
memento = wc.get_memento(u, exact=False)
    raise MementoPlaybackError(f'Memento at {url} could not be played')
wayback.exceptions.MementoPlaybackError: Memento at https://web.archive.org/web/20230212225711/https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/ could not be played

The comment in that section of the WaybackClient code states that this error should only occur if exact is True or if the target URL is outside the target_window. I don't think either of those apply because I'm setting exact to False and the target URL has the same timestamp:

original url / target url (both are 20230212225711)

Anyone know what might cause this?

Mr0grog commented 1 year ago

Ah! It turns out the URL you are using effectively sets the mode parameter to Mode.view, which gets you a response designed for viewing in a web browser (it has lots of tweaks and extras and is not the original, archived HTTP response).

The URL you requested was a redirect, but in view mode, the Wayback Machine gives us a normal webpage (not a redirect) with info about the where the redirect is going and pauses for a few seconds before redirecting with JavaScript. I obviously haven’t done rigorous-enough testing with that playback mode (we almost always use the default, which is mode=Mode.original); it looks like it’s going to be a bit tricky to detect this scenario in a way that works even if the design of the Wayback Machine’s redirect page changes.

That said, did you intend to use mode=Mode.view? If not, you should either:

  1. (Recommended) Don’t use the full Internet Archive URL when requesting a memento. Instead, use the URL of the page you want and the timestamp parameter:

    url = 'https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/'
    client.get_memento(url, timestamp='20230212225711', exact=False)
  2. Or make sure to append id_ to the end of the timestamp portion of the URL to set the mode:

    url = 'https://web.archive.org/web/20230212225711id_/https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/'
    client.get_memento(url, exact=False)
rhaksw commented 1 year ago

Thank you for this detailed explanation. You're correct that I intended to use mode=Mode.original. Now I've done that via the second solution you gave. I didn't mention it before, but the URL I was using came from the view_url of results of search(). I switched to use the raw_url. Maybe I'll go back later and use the original url and timestamp instead as you recommend.

Mr0grog commented 1 year ago

the URL I was using came from the view_url of results of search(). I switched to use the raw_url.

If you are using CdxRecord objects from the search() method, you can just pass them directly to get_memento() and it’ll pull out the right values for you! It’s a little easier that way:

for record in client.search('https://somewhere.com/', ...):
    get_memento(record, exact=False)  # gets `original` mode by default
    # or: get_memento(record, mode=wayback.Mode.view, exact=False)
rhaksw commented 1 year ago

Thank you, that is indeed easier. I must've missed it when first reading the docs and coding this up.