edgi-govdata-archiving / web-monitoring-ops

Documentation and configuration files for EDGI’s deployment of Web Monitoring tools.
GNU General Public License v3.0
1 stars 1 forks source link

Describe Wayback SPN Tools in “Manually Managed” Section #14

Closed Mr0grog closed 5 years ago

Mr0grog commented 5 years ago

I’ve been meaning to do this for a while, but haven’t actually gotten to it. We have a box that pokes the Wayback Machine’s Save Page Now feature to save a list of URLs every other day. It uses the code in https://github.com/Mr0grog/wayback-spn-client

This was initially an experiment to see if we could streamline the process of ensuring Wayback is archiving pages we care about, which used to be:

Analyst [Team Lead]
  → Slacks @Mr0grog
    → Slacks/E-mails Wayback Folks (and maybe gets a little bit of a
      runaround when this task has been handed off to someone else
      over there)
      → Someday a Wayback Engineer adds it to the config for the
        crawler doing our work.

And to make ourselves less vulnerable to breakage on the machine that is doing our archiving over at the Internet Archive (which has fallen over a few times and is why we have this health check script)

It’s been running for a while now, though, and I’m not sure I would still call it an experiment :)

Information about the machine’s setup and the list of URLs we run the script with should live here in the manually-managed directory. This is especially true since I update the list of URLs it saves every few days with whatever the healthcheck script has found that Wayback is not actively monitoring. That list lives only on the server and my hard drive, which is bad bad bad.