agude / wayback-machine-archiver

A Python script to submit web pages to the Wayback Machine for archiving.
https://pypi.org/project/wayback-machine-archiver/
MIT License
71 stars 11 forks source link

flag for setting no to "save error pages" #10

Open test2a opened 4 years ago

test2a commented 4 years ago

hi. Backing up twitter is throwing error pages but if we manually add the url on http://web.archive.org/save and uncheck the save error pages, the page is saved. It must be a problem with twitter or something. Anyways, is there a flag with which we can unset this error page setting? i am hoping this would help

agude commented 4 years ago

To answer the question directly: no, that flag is not currently supported. But I would be interested in getting it to work!

Some Exploration

Looks like when you click the box, the request body is just a url parameter. Otherwise it also sends capture_all=on.

I'll see what I can do. I noticed they have an API now, so I've sent an email to Archive.org for more information.

test2a commented 4 years ago

thats great. i have a further question but i didnt want to start another issue so i'll ask here. If i have to send a bunch of urls to backup at once, can i use a text file because all i can find is about using an xml. So, if i created an xml with the urls, would that work?

edit: oh. my bad. found what i was looking for. thanks anyways

agude commented 4 years ago

Yes, these is, and I don't yet have it documented in the README, oops!

Here is how:

Create a text file with one url per line, like this:

https://google.com
https://amazon.com

Let's say that's named urls.txt. Then call the script like this:

archiver --file urls.txt

If you are saving a large number of pages, you might want to set --rate-limit-wait to a large number, because Archive.org will ratelimit and then block you if you hammer them too hard too fast. I've had it happen to me, which is why the default rate limit in the script is 5.

test2a commented 4 years ago

oh. i am using a script to get a list of urls and yeah, i used exactly that. found it in the --help actually. anyways, when i tried to do

from wayback_machine_archiver import archiver

and then

archiver (variable)

it says

archiver (variable) TypeError: 'module' object is not callable

now, further used

wayback_machine_archiver.archiver (url)

but it resulted in

wayback_machine_archiver.archiver (url) NameError: name 'wayback_machine_archiver' is not defined

Now, i have managed to bypass this error by outputting my "url" variable which prints onscreen to a text file which i then import into archiver using

archiver --file textfilename

is it possible to make archiver accept the urls via a variable that outputs them one line at a time?

Thanks a bunch. its really appreciated

url is my variable that contains the list of urls.

test2a commented 4 years ago

oh, also, do we need to do some sort of delta check with the text file to see if the urls have been updated already or does waybackmachine accept the whole thing just like that? second, isnt there some sort of feedback ? saved this url or didnt save this url? something else?

i'm sorry for bugging you with these trivial things

agude commented 4 years ago

Oh, good questions. Let me try to summarize them and answer:

Can I use archiver as a library so my own script can easily backup URLs?

Not in it's current state. You could import the individual functions and write a little glue code though. Something like:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from wayback_machine_archiver import format_archive_url, call_archiver

# Set up the requests session
session = requests.Session()

retries = Retry(
    total=5,
    backoff_factor=5,
    status_forcelist=[500, 502, 503, 504],
)

session.mount("https://", HTTPAdapter(max_retries=retries))
session.mount("http://", HTTPAdapter(max_retries=retries))

# Backup a URL that is stored in a string called url
formatted_url = format_archive_url(url)
call_archiver(formatted_url, rate_limit_wait=5, session)

Do I have to check for duplicate URLs?

~Yes. I don't check for duplicates and neither does Internet Archive (except they might return an error, because they don't allow more than one backup per unique URL per 10 minutes).~

~I think it would be reasonable for me to de-duplicate the URLs before archiving them though. I'll open a bug for that and fix it tonight.~

If you're using the script on the command line, no you do not. I now check for duplicates starting in version 1.6.0.

That, of course, won't help if you use my above code suggestion to use my code as a library.

One Last Thought...

If you're on Linux, and your program is outputting URLs on stdout, you could do something like:

test2a_program | xargs archiver

This assumes you output all of the URLs at once, if you output them one after the other you could use sponge (from moreutils):

test2a_program | sponge | xargs archiver
test2a commented 4 years ago

Twitter is not working. The page saves but it says not found in the snapshot. For the past few days, I am also not seeing "snapshot" even when I click on the button on the website.

agude commented 4 years ago

Does it work if you go to the Wayback Machine website and archive the Twitter page?

I haven't changed anything in the script, so it's possible they changed something on the backend.

test2a commented 4 years ago

nah. even the website web.archive.org/save doesnt work. i was able to like a week ago to uncheck show error and snapshot but both seem unresponsive today

test2a commented 4 years ago

i am still testing. I was able to save a twitter link but photos arent coming up. i will continue testing more urls and report my findings

lauhaide commented 4 years ago

Hi all, I see that above you talk about duplicate URLs.

When the archiver runs web.archive.org/save/ what happens if the URL was already archived? does it replace the previous version or a new time-stamp is added?

Another question, is it possible to recover the time-stamp when saving?

Thanks!

test2a commented 4 years ago

@lauhaide i think archiver saves a new copy of the url with current date and time in the wayback machine so you can see pages over time and no it does not overwrite anything

second one, i am not sure what you mean. when the page is saved, it saves the metadata along with it so you can see that

agude commented 4 years ago

Hi @lauhaide!

This script doesn't overwrite, as it were, because it asks The Wayback Machine to save a snapshot of the current page. As you can see here with the Yahoo.com archive there are multiple snapshots stored each day.

As for recovering the timestamp, you could get that from the Wayback Machine itself (you'll see each snapshot is timestamped on the Yahoo page for example), but that's not something this tool supports.

If you are on Linux, you could do something like this:

archiver https://yahoo.com --log DEBUG 2>&1 | ts '[%Y-%m-%d %H:%M:%S]'

That would timestamp every line of the debug output like this:

[2020-08-11 13:17:12] DEBUG:root:Arguments: Namespace(archive_sitemap=False, file=None, jobs=1, log_file=None, log_level='DEBUG', rate_limit_in_sec=5, sitemaps=[], urls=['https://yahoo.com'])
[2020-08-11 13:17:12] INFO:root:Adding page URLs to archive
[2020-08-11 13:17:12] DEBUG:root:Page URLs to archive: ['https://yahoo.com']
[2020-08-11 13:17:12] DEBUG:root:Creating archive URL for https://yahoo.com
[2020-08-11 13:17:12] INFO:root:Parsing sitemaps
[2020-08-11 13:17:12] DEBUG:root:Archive URLs: {'https://web.archive.org/save/https://yahoo.com'}
[2020-08-11 13:17:13] DEBUG:root:Sleeping for 5
[2020-08-11 13:17:18] INFO:root:Calling archive url https://web.archive.org/save/https://yahoo.com
[2020-08-11 13:17:18] DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): web.archive.org:443
[2020-08-11 13:17:18] DEBUG:urllib3.connectionpool:https://web.archive.org:443 "HEAD /save/https://yahoo.com HTTP/1.1" 301 0
[2020-08-11 13:17:30] DEBUG:urllib3.connectionpool:https://web.archive.org:443 "HEAD /save/https://www.yahoo.com/ HTTP/1.1" 200 0

Which you could use to get a rough timestamp from.

I will add: I don't consider DEBUG messages as part of the public API, so I might break your script with a minor update, but probably won't. :-)

lauhaide commented 4 years ago

Thanks @agude , @test2a for your prompt replies. It's now clear to me now, save will add a new backup of the URL.

As for the timestamp for further retrieval, it could be something like this (time seems not necessary):

http://web.archive.org/web/20200811*/URL

A last question about --rate-limit-wait (as mentioned in above posts), for a large nb of pages to archive, which minimum value would be recommended to use?

PS. no problem with the code update :-)

agude commented 4 years ago

I run this script to back up my personal site every evening. It's about 100 pages, and I run with --rate-limit-wait=60. It completes most of the time, but every few weeks it'll error out due to rate limiting from the Internet Archive.

So I don't have an exact number for you, but I would say closer to 30-60 seconds than 1-2. :-)

lauhaide commented 4 years ago

Thanks @agude , I had started running with --rate-limit-wait=5 and is running , will it log if the request gets an error?

agude commented 4 years ago

@lauhaide: The program will throw an error and terminate when it fails. Right here:

https://github.com/agude/wayback-machine-archiver/blob/master/wayback_machine_archiver/archiver.py#L36-L41

lauhaide commented 4 years ago

Thanks a lot @agude !!!