GOCDB / gocdb-failover-scripts

Scripts and doc for hosting a failover
1 stars 2 forks source link

GOCDB Failover Automation Scripts/Dirs

Author: David Meredith + JK

[this ascii file is coded in "markdown" and is best viewed in a markdown enabled browser, see https://en.wikipedia.org/wiki/Markdown for more details]

This repo contains the service and cron scripts used to run a failover gocdb instance, includes the following dirs:

Packages

/root/autoEngageFailover/

Start in this dir. Dir contains the 'gocdb-autofailover.sh' service script which should be installed as a service in '/etc/init.d/gocdb-autofailover'. This service invokes 'engageFailover.sh' which monitors the production instance with a ping-check. If a continued outage is detected; the script starts the failover procedure which includes the following:

/root/importDBdmpFile/

Contains scripts that fetches the .dmp file and install this dmp file into the local Oracle XE instance. The master script is '1_runDbUpdate.sh' which needs to be invoked from an hourly cron:

# more /etc/cron.hourly/cronRunDbUpdate.sh
#!/bin/bash

/root/importDBdmpFile/1_runDbUpdate.sh

You will also need to:

/root/fetchMariaDBdmpFile/

Contains the failover_fetch.py script and configuration file for executing a remote database dump (using the mysqldump utility) and the saving of the resulting dump file locally as a timestamped archive file. Run as failover_fetch.py --config <config file path> . With no --config option specified, the default is ./fetchMariaDBdmpFile/config.ini

/root/importMariaDBdmpFile/

Contains the failover_import.py script and configuration file for fetching a remote database dump (as generated by the mysqldump utility) and loading the dump into the failover DB. Optionally, the dump file can be wrapped as a .zip file archive. Run as failover_import.py --config <config file path> . With no --config option specified, the default is ./importMariaDBdmpFile/config.ini

/root/nsupdate_goc/

Contains the nsupdate keys and nsupdate scripts for switching the 'goc.egi.eu' top level DNS alias to point to either the production instance or the failover.

/root/archiveDmpDownload/

Contains a script that downloads the dmp file and stores the file in the archive/ sub-dir. The script also deletes archived files that are older than 'x' days. This script can be called in a separate process, e.g. from cron.daily to build a set of backups.

Failover Instructions

To start/stop the auto failover service

This will continuously monitor the production instance and engage the failover automatically during prolonged outages

Run as a service:

chkconfig --list | grep gocdb-auto
/sbin/service gocdb-autofailover stop
/sbin/service gocdb-autofailover start
/sbin/service gocdb-autofailover status

Directly (not as a service):

cd /root/autoEngageFailover
./gocdb-autofailover.sh {start|stop|restart}

To manually engage the failover immediately

E.g. for known/scheduled outages, run the following passing 'now' as the first command-line argument:

Stop the service:

service gocdb-autofailover stop

Or to stop if running manually:

cd /root/autoEngageFailover
./gocdb-autofailover.sh stop

Engage the failover now:

./engageFailover.sh now

Restore failover service after failover was engaged

You will need to manually revert the steps executed by the failover so the dns points back to the production instance and restore/restart the failover process. This includes:

Restore Walkthrough

At end of downtime (production instance ready to be restored) first re-point DNS:

echo We first switch dns to point to production instance
cd /root/nsupdate_goc
./goc_production.sh

Now wait for DNS to settle, this takes approx 2hrs and during this time the goc.egi.eu domain will swtich between the failover instance and the production instance. You should monitor this using nsupdate:

nslookup goc.egi.eu
# check this returns the following output referring to
# goc.stfc.ac.uk
    Non-authoritative answer:
    goc.egi.eu canonical name = goc.stfc.ac.uk.
    Name: goc.stfc.ac.uk
    Address: 130.246.143.160

After DNS has become stable the production instance will now be serving requests. Only after this ~2hr period should we re-start failover service:

echo First go check production instance and confirm it is up
echo running ok and that dns is stable
rm /root/autoEngageFailover/engage.lock
mv cronRunDbUpdate.sh /etc/cron.hourly

# Below server cert change no longer needed as cert contains dual SAN
# This means a server restart is no longer needed.
#echo Change server certificate and key back for gocdb.hartree.stfc.ac.uk
#ln -sf /etc/pki/tls/private/gocdb.hartree.stfc.ac.uk.key.pem /etc/pki/tls/private/hostkey.pem
#ln -sf /etc/grid-security/gocdb.hartree.stfc.ac.uk.cert.pem /etc/grid-security/hostcert.pem
#service httpd restart
#service gocdb-autofailover start
#service gocdb-autofailover status
#  gocdb-autofailover is running...

Now check the '/root/autoEngageFailover/pingCheckLog.txt' and '/root/autoEngageFailover/errorEngageFailover.txt' files to see that the service is running ok and pinging every ~10mins.