Nagios Bug: When flapping detection is enabled, not receiving second recovery notification

ddanielr commented 8 years ago

When using the yum installed nagios 3.5.1 on centOS 6.7 I am not receiving the second recovery notification email when flap detection is turned on.

I have testing this manually by starting and stopping a service, twice on the target host. I receive the first and second critical notification, and the first recovery notification, but never the second one.

Nagios will report that the host is up with a Hard State change. but the logs show it never sends that final notification email. However, if I disable flap detection on that host then the second recovery email is sent without an issue.

Test case: Enable a service check for check_http on a target host using check_by_ssh and check_procs. Ensure that nagios is currently reporting a OK status for that service check. Stop httpd, wait for hard state change and email to be sent, start httpd, wait for hard state change and email. Repeat.

Disable flap detection on host and in nagios.cfg, reload nagios and repeat check. Second recovery email is sent without issue.

My testing configurations are below.

nagios.cfg:

enable_flap_detection=1

target host config:

define host{ host_name target-host ; The name of this host alias address notifications_enabled 1 ; Host notifications are enabled event_handler_enabled 1 ; Host event handler is enabled flap_detection_enabled 0 ; Flap detection is enabled failure_prediction_enabled 0 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts notification_period 24x7 ; Send host notifications at any time check_period 24x7 ; By default, Linux hosts are checked round the clock check_interval 2 ; Actively check the host every 2 minutes retry_interval 1 ; Schedule host check retries at 1 minute intervals max_check_attempts 4 ; Check each Linux host 4 times (max)

    check_command                   check-host-alive ; Default command to check Linux hosts
    notification_interval           99999           ;Never resend notifications 
    notification_options            d,u,r           ; Only send notifications for specific host states
    contact_groups                  admins          ; Notifications get sent to the admins by default
    }

check_by_ssh config:

define command{ command_name check_http command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "check_procs -c 1: -C httpd" -t 20 -q }

robcarGIT commented 8 years ago

Did you find a solution? I got the same issue, I have an Ubuntu box, Nagios 3.2.3.

Recovery is working fine, unless service is flapping, in which case no notification is sent. I've been waiting till the default low threshold (default 5%) has been reached but got no notifications.

ddanielr commented 8 years ago

nope, we ended up turning off flap detection so we get all the error emails.

It's a huge pain when something, such as a disk space check, is bouncing between states.

I would love a hysteresis option for nagios checks so we could alert when a disk space check drops below 20% free but stays warning until it pops up above 25%

ericloyd commented 8 years ago

I'm going to suggest you look at "bischeck" which is an add-on for Nagios that lets you do heuristic warnings based on trends, not just thresholds.

I'm trying to convince the author to do a talk a this year's Nagios World Conference, but his schedule is tight.

http://www.bischeck.org/

In the meantime, you could always use two checks, one that is dependent upon the other with different warning thresholds, and the second one only runs if the first one is in a failure state. Then set notifications on the second and no notifications on the first.

tmcnag commented 8 years ago

Oddly enough, I wrote some documentation for just that project:

https://assets.nagios.com/downloads/nagiosxi/docs/Integrating-the-Bischeck-Plugin-Extension-With-Nagios-XI.pdf

It's written with XI in mind, but most of the steps are not specific to XI/Core and as long as you have NRDP or NSCA on your Nagios server it will work the same.

ericloyd commented 8 years ago

Oddly enough, I've got a source code credit in there somewhere, too (http://www.bischeck.org/?p=173) :-)

I keep trying to get Anders to submit a paper to talk about updated bischeck at this year's conference, but I don't think he has.

ghomem commented 8 years ago

I'm seeing exactly the same issue on nagios 3.5 on Centos 6 the final post-flapping recovery notification is not generated. I've checked on the Nagios log just to be sure it wasn't the email getting lost. We see "SERVICE ALERT" for the recovery but not "SERVICE NOTIFICATION".

jfrickson commented 8 years ago

Notifications are disabled when a service or host is flapping. When it stops flapping, a Flapping Stop notification is sent out. If you want to see notifications for that, enable it in notification_options. Be default, flapping notifications are not turned on for hosts or services.

jfrickson commented 8 years ago

Notifications are disabled when a service or host is flapping. When it stops flapping, a Flapping Stop notification is sent out. If you want to see notifications for that, enable it in notification_options. Be default, flapping notifications are not turned on for hosts or services.

Given that, I'm closing this issue. If any of you object, post another message here with your rationale.

rahulghanate commented 7 years ago

But after flapping is stopped, it should also send the current status of the host/service. For my case tickets are created/closed with emails, and flapping stops the notification after ticket is created, but never sends recovery email due to flapping. And once flapping stops, it still doesn't sent recovery email and the tickets stays open until I manually close it. I can enable flapping notifications, but once I get flapping-stopeed, I might have manually check if service still in critical state or now back to normal, that's overhead. After flapping stopped it should send a notification with current state.

marcvangend commented 5 years ago

I'm a little late to the party here, but I totally agree with @rahulghanate. Maintainers (@jfrickson ?), can you confirm that a flapping-stop notification does not necessarily mean that the service has recovered?

If this is going to be implemented, I guess the logic should be like this: At the moment flapping starts, the previous state (which should be the state for which the last notification was sent) is stored. After flapping has ended, the stored state is compared to the new state. If they are not identical, a notification is sent for the new state.

rahulghanate commented 5 years ago

This sounds much better solution. Compare the states before and after and notify accordingly.

fuzzbawl commented 5 years ago

I also agree with @rahulghanate. I would love to see this implemented and we have a huge problem today with this happening. We operate and monitor a very large (over 3600 device) ISP network and some of our customers have wireless links for backup. During harsh weather, those bounce and enter flapping state and drive us crazy when our techs think the equipment is still down because no final "up" notification was sent that causes ticket to close out.

agougo commented 4 years ago

I am experiencing the exact same issue. When flapping is detected the state / status of the host is not sent as a notification. I rely on critical/OK messages to automatically clear alerts and this is a pain because if the OK message is not sent my alert stays there forever and is not being closed automatically. Is this being looked at ?

sawolf commented 4 years ago

Thanks for commenting here, I hadn't seen this thread before.

jfrickson has a point in that Core is currently working according to documentation, but I agree that a change in behavior makes sense. I'm re-opening this as an enhancement request for now, and I'll see what I can do.

antofthy commented 6 months ago

Going back to the OP...

The default setting of the 'high' threshold (turn flapping on threshold over last 21 checks (20 possible transitions) is 20% that is equal to 4 state changes... But by the docs is enable when transitions is >= 4. So given a flapping service... the notifications will go (assuming it starts at 'ok' which is reasonable...

critical, ok, critical, ok (flapping)

That is in a general situation the final 'flapping' reported state is 'ok'!

Which I do not regard as being a 'good' state (in general) to report on.

After that no more notifications until 21 steady checks have been recieved! That is 20 possible transitions is less than low threshold (default 5%) That is < 1 , or 0 transitions in last 21 checks.

japc commented 4 months ago

Have this same problem. Would be nice to have it check the last notified state against the state when flapping stops and notify.

NagiosEnterprises / nagioscore

Nagios Bug: When flapping detection is enabled, not receiving second recovery notification #91