Suppress multiple alerts when node offline

sierra-tango-echo commented 6 years ago

Investigate disabling / silencing other checks on a node if the primary UP/DOWN check fails

rossrodwell commented 6 years ago

A solution for this is almost in place, I just have a race condition to deal with.

rossrodwell commented 6 years ago

Looks like we have something that works. Although there are two things to note:

1) Modifications to Freshness Thresholds for Service checks Required Though it is dependent on modifications to the freshness_threshold for service checks. That is:

freshness_threshold for services = freshness_threshold for hosts * 2(freshness check interval)

I have this value set under the test conditions.

That is to say, 540s for freshness_threshold for (test) services

freshness_threshold for hosts is 420s freshness_check_interval is 60s (for hosts and services).

2) Notifications for Services are still sent when the node comes back online.

Although lots of DOWN notifications for services are no longer being sent out when a host is DOWN, when the host becomes available, lots of UP notifications are being sent out.

rossrodwell commented 6 years ago

I am going to activate the solution to issue #3, before my solution to this issue. I think service notifications for hosts may be suppressed as part of the solution to #3, if I confirm this is the case, there will be no need for me to roll out the send_nrdp.sh script that has been written.

rossrodwell commented 6 years ago

I had to roll out my stale_data.sh script for this to work, even with the solution to #3 in place. The script is now in place and is active.

alces-software / nagios-base

Suppress multiple alerts when node offline #2