alces-software / nagios-base

Installation, Sync scripts and Plugins
1 stars 1 forks source link

Suppress multiple alerts when cluster offline #3

Closed sierra-tango-echo closed 6 years ago

sierra-tango-echo commented 6 years ago

Investigate implementing a VPN connectivity check for each cluster, if that fails then disable / silence all other checks on that cluster as they'll all fail due to no connectivity

rossrodwell commented 6 years ago

There are a few ways I can think of to solve this.

1) Modify my stale_data.sh script (which is run when data becomes stale), to check the status of the VPN.

2) Another is to add a VPN 'service' for each VPN connection on the flightcenter gw, and then set all services in the cluster to be a dependency (Mark actually pointed out a problem with this, but I'm not sure if it would have been resolved with tweaking freshness values). In fact, adding these services would duplicate checks that Mark defined.

3) Defining parent-child relationships in Nagios so that it can make use of its reachability logic, and then setting the notifications for UNREACHABLE to be suppressed. (We should just get an alert when the (presently defined) VPN objects have their state as DOWN.) As an aside, we could make use of the parent-child relationships within the clusters as well, so now might be an appropriate time to implement this.

I will begin testing. I shall start with 3) in my sandbox area.

rossrodwell commented 6 years ago

I am satisfied that 3 is a workable solution when my I extrapolate my test results to larger scale.

rossrodwell commented 6 years ago

This has now been activated.