I’ve been meaning to do this for a while, but haven’t actually gotten to it. We have a box that pokes the Wayback Machine’s Save Page Now feature to save a list of URLs every other day. It uses the code in https://github.com/Mr0grog/wayback-spn-client
This was initially an experiment to see if we could streamline the process of ensuring Wayback is archiving pages we care about, which used to be:
Analyst [Team Lead]
→ Slacks @Mr0grog
→ Slacks/E-mails Wayback Folks (and maybe gets a little bit of a
runaround when this task has been handed off to someone else
over there)
→ Someday a Wayback Engineer adds it to the config for the
crawler doing our work.
And to make ourselves less vulnerable to breakage on the machine that is doing our archiving over at the Internet Archive (which has fallen over a few times and is why we have this health check script)
It’s been running for a while now, though, and I’m not sure I would still call it an experiment :)
Information about the machine’s setup and the list of URLs we run the script with should live here in the manually-managed directory. This is especially true since I update the list of URLs it saves every few days with whatever the healthcheck script has found that Wayback is not actively monitoring. That list lives only on the server and my hard drive, which is bad bad bad.
I’ve been meaning to do this for a while, but haven’t actually gotten to it. We have a box that pokes the Wayback Machine’s Save Page Now feature to save a list of URLs every other day. It uses the code in https://github.com/Mr0grog/wayback-spn-client
This was initially an experiment to see if we could streamline the process of ensuring Wayback is archiving pages we care about, which used to be:
And to make ourselves less vulnerable to breakage on the machine that is doing our archiving over at the Internet Archive (which has fallen over a few times and is why we have this health check script)
It’s been running for a while now, though, and I’m not sure I would still call it an experiment :)
Information about the machine’s setup and the list of URLs we run the script with should live here in the
manually-managed
directory. This is especially true since I update the list of URLs it saves every few days with whatever the healthcheck script has found that Wayback is not actively monitoring. That list lives only on the server and my hard drive, which is bad bad bad.