ITRS-Group / monitor-merlin

Module for Effortless Redundancy and Loadbalancing In Naemon
https://itrs-group.github.io/monitor-merlin/
GNU General Public License v2.0
22 stars 14 forks source link

Host Notification issue with Loadbalancing #126

Open eschoeller opened 2 years ago

eschoeller commented 2 years ago

I'm opening a new issue for some odd behavior I'm seeing with host notifications. I've seen this behavior occur about 4 times now but I haven't necessarily been able to replicate it intentionally yet.

I was using escalations for host notifications but removed that in favor of delay_first_notification. Sometimes it appears that Naemon/Merlin chooses to send out a host notification earlier than when it should. Take the following example:

Sep 17 06:31:05 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:05 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:05 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:24 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:24 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:24 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:42 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:42 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:42 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:03 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:03 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:03 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:05 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:32:05 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:32:05 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:32:24 node_b naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin1+admin2+admin3_email;pdu-g24;DOWN;dcops_host-notify-by-email;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin1-pager;pdu-g24;DOWN;dcops_host-notify-by-email-pager;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin1-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin2-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin3-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_a naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:32:24 node_c naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification blocked because first_notification_delay is configured and not enough time has elapsed since the object changed to a non-UP/non-OK state (or since program start).
Sep 17 06:32:27 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Re-notification blocked for this problem because not enough time has passed since last notification.
Sep 17 06:32:27 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Re-notification blocked for this problem because not enough time has passed since last notification.
Sep 17 06:32:27 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Re-notification blocked for this problem because not enough time has passed since last notification.
Sep 17 06:32:51 node_b naemon: HOST ALERT: pdu-g24;UP;HARD;1;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin1+admin2+admin3_email;pdu-g24;UP;dcops_host-notify-by-email;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin1-pager;pdu-g24;UP;dcops_host-notify-by-email-pager;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin1-pushover;pdu-g24;UP;dcops_host-notify-by-pushover;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin2-pushover;pdu-g24;UP;dcops_host-notify-by-pushover;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin3-pushover;pdu-g24;UP;dcops_host-notify-by-pushover;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_a naemon: HOST ALERT: pdu-g24;UP;HARD;1;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:32:51 node_c naemon: HOST ALERT: pdu-g24;UP;HARD;1;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:33:13 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:33:13 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:33:13 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:33:27 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:33:27 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:33:27 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:34:16 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:34:16 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:34:16 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:35:13 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:35:13 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:35:13 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:09 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:09 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:09 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_b naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_a naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification blocked because first_notification_delay is configured and not enough time has elapsed since the object changed to a non-UP/non-OK state (or since program start).
Sep 17 06:36:48 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:36:48 node_c naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin1+admin2+admin3_email;pdu-g24;DOWN;dcops_host-notify-by-email;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin1-pager;pdu-g24;DOWN;dcops_host-notify-by-email-pager;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin1-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin2-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin3-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%

Look particularly at 06:32:24. 'node_b' logs a HARD state and immediately sends a notification. At the same time node_a reports that the notification is blocked by a NEB module and will be handled by node_b (that makes sense). But then node_c reports something entirely different:

Sep 17 06:32:24 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification blocked because first_notification_delay is configured and not enough time has elapsed since the object changed to a non-UP/non-OK state (or since program start).

So for some reason node_c realizes first_notification_delay is in play ... but somehow node_a doesn't. I can confirm that all three nodes have the exact same configuration.

Here is the relevant configuration for host checks:

define host {
        active_checks_enabled           1
        check_command                   team_check-host-alive
        check_freshness                 0
        check_period                    team_24x7
        contact_groups                  admin1+admin2+admin3_email
        event_handler_enabled           0
        flap_detection_enabled          1
        max_check_attempts              5
        name                            team_generic_host_template
        notification_interval           2
        notification_options            d,r
        notification_period             team_24x7
        notifications_enabled           1
        first_notification_delay        3
        obsess_over_host                0
        passive_checks_enabled          0
        process_perf_data               1
        register                        0
        retain_nonstatus_information    1
        retain_status_information       1
}
eschoeller commented 2 years ago

My only guess is that the irregularity in behavior (sometimes it works, sometimes it doesn't) might have to do with which node actually executes the host check and which node is generating the notification. I wouldn't be able to say if it's better when one node does both, or worse. But I'm having trouble reliably reproducing it.

jacobbaungard commented 2 years ago

Hi @eschoeller, thanks for sending this in.

I was able to replicate this issue and see the same behaviour once. However I also saw a bunch of other behaviour:

node a: blocked by neb module node b: blocked due to first_notification_delay node c: blocked by neb module

and

node a: blocked due to first_notification_delay node b: blocked due to first_notification_delay node c: blocked due to first_notification_delay

It's not good that the behaviour is so inconsistent, perhaps there's a race condition somewhere.

If you find a consistent way to replicate this, that would be very helpful. I am not sure how much resources I'll be able to dedicate to fixing this, but I will keep it in mind if I have some time over.

eschoeller commented 2 years ago

Well I am both pleased and disappointed that you have been able to see the same behavior. My current work-around was simply to boost the max_check_attempts to something much higher (25) so we don't get paged immediately when hosts have a 'hiccup' of sorts. But ultimately it would be nice to have this resolved. I can, at some point, try reproducing this with just 1/3 nodes active with and without the Merlin module loaded to see if there's any difference there. Then perhaps with 2/3 nodes active. Maybe try to parse through some Naemon debug output. But unfortunately I am transitioning roles at my current employer and my available time to work on this infrastructure may be very limited (depending on which direction we go with 'observability' in general) hence my response on a Sunday!