Open cmungall opened 9 years ago
@cmungall, @alanruttenberg, and I now have access the PURL server via SSH and to DNS via the new account. Since the server is on Amazon Web Services, I've added a CloudWatch alert that will email the three of us and also text me in case the server fails a High Status Check for a few minutes.
That alert only tells us that the server is down, not that it's misbehaving in some other way. We should have a heartbeat monitor. A few hand picked PURLs might be good enough. If we want to do better, the test.py
script generates about 1000 test cases that we could draw from.
For a heartbeat monitor, the most basic thing might be another Jenkins job on http://build.berkeleybop.org. That would be publicly visible, but admin requires authentication. Chris, his team, and I have accounts there. I've never used the other tools Chris mentions.
In case of emergency, if the server is still up, then one of the three of us can log in, restart Apache, fiddle with Git and run Make again. If the server goes down, only I have permission to bring it back up, since it's my AWS account (under my credit card). Someone else can create a new server quickly using Ansible and the README, and switch DNS. That should keep downtime to a few hours.
Noting that this is at least partially covered at: https://status.obofoundry.org/ Current email recipients are myself and @cmungall . This can be shifted around on / added to as others are trained. This account is under my name; I would recommend holding this in a new and separate account at some point, but this is a step forward.
Companion to: https://github.com/OBOFoundry/OBOFoundry.github.io/issues/145
nagios/munin/uptime-robot...?
How would this be done? Do we have a test URL? Random ontology?