Open test2a opened 4 years ago
To answer the question directly: no, that flag is not currently supported. But I would be interested in getting it to work!
Looks like when you click the box, the request body is just a url
parameter. Otherwise it also sends capture_all=on
.
url=twitter.com%2Fnoahpinion
url=twitter.com%2Fnoahpinion&capture_all=on
I'll see what I can do. I noticed they have an API now, so I've sent an email to Archive.org for more information.
thats great. i have a further question but i didnt want to start another issue so i'll ask here. If i have to send a bunch of urls to backup at once, can i use a text file because all i can find is about using an xml. So, if i created an xml with the urls, would that work?
edit: oh. my bad. found what i was looking for. thanks anyways
Yes, these is, and I don't yet have it documented in the README, oops!
Here is how:
Create a text file with one url per line, like this:
https://google.com
https://amazon.com
Let's say that's named urls.txt
. Then call the script like this:
archiver --file urls.txt
If you are saving a large number of pages, you might want to set --rate-limit-wait
to a large number, because Archive.org will ratelimit and then block you if you hammer them too hard too fast. I've had it happen to me, which is why the default rate limit in the script is 5.
oh. i am using a script to get a list of urls and yeah, i used exactly that. found it in the --help actually. anyways, when i tried to do
from wayback_machine_archiver import archiver
and then
archiver (variable)
it says
archiver (variable) TypeError: 'module' object is not callable
now, further used
wayback_machine_archiver.archiver (url)
but it resulted in
wayback_machine_archiver.archiver (url) NameError: name 'wayback_machine_archiver' is not defined
Now, i have managed to bypass this error by outputting my "url" variable which prints onscreen to a text file which i then import into archiver using
archiver --file textfilename
is it possible to make archiver accept the urls via a variable that outputs them one line at a time?
Thanks a bunch. its really appreciated
url is my variable that contains the list of urls.
oh, also, do we need to do some sort of delta check with the text file to see if the urls have been updated already or does waybackmachine accept the whole thing just like that? second, isnt there some sort of feedback ? saved this url or didnt save this url? something else?
i'm sorry for bugging you with these trivial things
Oh, good questions. Let me try to summarize them and answer:
Can I use
archiver
as a library so my own script can easily backup URLs?
Not in it's current state. You could import the individual functions and write a little glue code though. Something like:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from wayback_machine_archiver import format_archive_url, call_archiver
# Set up the requests session
session = requests.Session()
retries = Retry(
total=5,
backoff_factor=5,
status_forcelist=[500, 502, 503, 504],
)
session.mount("https://", HTTPAdapter(max_retries=retries))
session.mount("http://", HTTPAdapter(max_retries=retries))
# Backup a URL that is stored in a string called url
formatted_url = format_archive_url(url)
call_archiver(formatted_url, rate_limit_wait=5, session)
Do I have to check for duplicate URLs?
~Yes. I don't check for duplicates and neither does Internet Archive (except they might return an error, because they don't allow more than one backup per unique URL per 10 minutes).~
~I think it would be reasonable for me to de-duplicate the URLs before archiving them though. I'll open a bug for that and fix it tonight.~
If you're using the script on the command line, no you do not. I now check for duplicates starting in version 1.6.0
.
That, of course, won't help if you use my above code suggestion to use my code as a library.
If you're on Linux, and your program is outputting URLs on stdout, you could do something like:
test2a_program | xargs archiver
This assumes you output all of the URLs at once, if you output them one after the other you could use sponge
(from moreutils):
test2a_program | sponge | xargs archiver
Twitter is not working. The page saves but it says not found in the snapshot. For the past few days, I am also not seeing "snapshot" even when I click on the button on the website.
Does it work if you go to the Wayback Machine website and archive the Twitter page?
I haven't changed anything in the script, so it's possible they changed something on the backend.
nah. even the website web.archive.org/save doesnt work. i was able to like a week ago to uncheck show error and snapshot but both seem unresponsive today
i am still testing. I was able to save a twitter link but photos arent coming up. i will continue testing more urls and report my findings
Hi all, I see that above you talk about duplicate URLs.
When the archiver runs web.archive.org/save/ what happens if the URL was already archived? does it replace the previous version or a new time-stamp is added?
Another question, is it possible to recover the time-stamp when saving?
Thanks!
@lauhaide i think archiver saves a new copy of the url with current date and time in the wayback machine so you can see pages over time and no it does not overwrite anything
second one, i am not sure what you mean. when the page is saved, it saves the metadata along with it so you can see that
Hi @lauhaide!
This script doesn't overwrite, as it were, because it asks The Wayback Machine to save a snapshot of the current page. As you can see here with the Yahoo.com archive there are multiple snapshots stored each day.
As for recovering the timestamp, you could get that from the Wayback Machine itself (you'll see each snapshot is timestamped on the Yahoo page for example), but that's not something this tool supports.
If you are on Linux, you could do something like this:
archiver https://yahoo.com --log DEBUG 2>&1 | ts '[%Y-%m-%d %H:%M:%S]'
That would timestamp every line of the debug output like this:
[2020-08-11 13:17:12] DEBUG:root:Arguments: Namespace(archive_sitemap=False, file=None, jobs=1, log_file=None, log_level='DEBUG', rate_limit_in_sec=5, sitemaps=[], urls=['https://yahoo.com'])
[2020-08-11 13:17:12] INFO:root:Adding page URLs to archive
[2020-08-11 13:17:12] DEBUG:root:Page URLs to archive: ['https://yahoo.com']
[2020-08-11 13:17:12] DEBUG:root:Creating archive URL for https://yahoo.com
[2020-08-11 13:17:12] INFO:root:Parsing sitemaps
[2020-08-11 13:17:12] DEBUG:root:Archive URLs: {'https://web.archive.org/save/https://yahoo.com'}
[2020-08-11 13:17:13] DEBUG:root:Sleeping for 5
[2020-08-11 13:17:18] INFO:root:Calling archive url https://web.archive.org/save/https://yahoo.com
[2020-08-11 13:17:18] DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): web.archive.org:443
[2020-08-11 13:17:18] DEBUG:urllib3.connectionpool:https://web.archive.org:443 "HEAD /save/https://yahoo.com HTTP/1.1" 301 0
[2020-08-11 13:17:30] DEBUG:urllib3.connectionpool:https://web.archive.org:443 "HEAD /save/https://www.yahoo.com/ HTTP/1.1" 200 0
Which you could use to get a rough timestamp from.
I will add: I don't consider DEBUG messages as part of the public API, so I might break your script with a minor update, but probably won't. :-)
Thanks @agude , @test2a for your prompt replies. It's now clear to me now, save will add a new backup of the URL.
As for the timestamp for further retrieval, it could be something like this (time seems not necessary):
http://web.archive.org/web/20200811*/URL
A last question about --rate-limit-wait
(as mentioned in above posts), for a large nb of pages to archive, which minimum value would be recommended to use?
PS. no problem with the code update :-)
I run this script to back up my personal site every evening. It's about 100 pages, and I run with --rate-limit-wait=60
. It completes most of the time, but every few weeks it'll error out due to rate limiting from the Internet Archive.
So I don't have an exact number for you, but I would say closer to 30-60 seconds than 1-2. :-)
Thanks @agude , I had started running with --rate-limit-wait=5 and is running , will it log if the request gets an error?
@lauhaide: The program will throw an error and terminate when it fails. Right here:
Thanks a lot @agude !!!
hi. Backing up twitter is throwing error pages but if we manually add the url on http://web.archive.org/save and uncheck the save error pages, the page is saved. It must be a problem with twitter or something. Anyways, is there a flag with which we can unset this error page setting? i am hoping this would help