Closed rhaksw closed 1 year ago
Ah! It turns out the URL you are using effectively sets the mode
parameter to Mode.view
, which gets you a response designed for viewing in a web browser (it has lots of tweaks and extras and is not the original, archived HTTP response).
The URL you requested was a redirect, but in view mode, the Wayback Machine gives us a normal webpage (not a redirect) with info about the where the redirect is going and pauses for a few seconds before redirecting with JavaScript. I obviously haven’t done rigorous-enough testing with that playback mode (we almost always use the default, which is mode=Mode.original
); it looks like it’s going to be a bit tricky to detect this scenario in a way that works even if the design of the Wayback Machine’s redirect page changes.
That said, did you intend to use mode=Mode.view
? If not, you should either:
(Recommended) Don’t use the full Internet Archive URL when requesting a memento. Instead, use the URL of the page you want and the timestamp
parameter:
url = 'https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/'
client.get_memento(url, timestamp='20230212225711', exact=False)
Or make sure to append id_
to the end of the timestamp portion of the URL to set the mode:
url = 'https://web.archive.org/web/20230212225711id_/https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/'
client.get_memento(url, exact=False)
Thank you for this detailed explanation. You're correct that I intended to use mode=Mode.original
. Now I've done that via the second solution you gave. I didn't mention it before, but the URL I was using came from the view_url
of results of search()
. I switched to use the raw_url
. Maybe I'll go back later and use the original url and timestamp instead as you recommend.
the URL I was using came from the
view_url
of results ofsearch()
. I switched to use theraw_url
.
If you are using CdxRecord
objects from the search()
method, you can just pass them directly to get_memento()
and it’ll pull out the right values for you! It’s a little easier that way:
for record in client.search('https://somewhere.com/', ...):
get_memento(record, exact=False) # gets `original` mode by default
# or: get_memento(record, mode=wayback.Mode.view, exact=False)
Thank you, that is indeed easier. I must've missed it when first reading the docs and coding this up.
Hi, I'm getting an error from this code,
The comment in that section of the WaybackClient code states that this error should only occur if exact is True or if the target URL is outside the target_window. I don't think either of those apply because I'm setting exact to False and the target URL has the same timestamp:
original url / target url (both are 20230212225711)
Anyone know what might cause this?