Open eschoeller opened 3 years ago
My only guess is that the irregularity in behavior (sometimes it works, sometimes it doesn't) might have to do with which node actually executes the host check and which node is generating the notification. I wouldn't be able to say if it's better when one node does both, or worse. But I'm having trouble reliably reproducing it.
Hi @eschoeller, thanks for sending this in.
I was able to replicate this issue and see the same behaviour once. However I also saw a bunch of other behaviour:
node a: blocked by neb module
node b: blocked due to first_notification_delay
node c: blocked by neb module
and
node a: blocked due to first_notification_delay
node b: blocked due to first_notification_delay
node c: blocked due to first_notification_delay
It's not good that the behaviour is so inconsistent, perhaps there's a race condition somewhere.
If you find a consistent way to replicate this, that would be very helpful. I am not sure how much resources I'll be able to dedicate to fixing this, but I will keep it in mind if I have some time over.
Well I am both pleased and disappointed that you have been able to see the same behavior. My current work-around was simply to boost the max_check_attempts to something much higher (25) so we don't get paged immediately when hosts have a 'hiccup' of sorts. But ultimately it would be nice to have this resolved. I can, at some point, try reproducing this with just 1/3 nodes active with and without the Merlin module loaded to see if there's any difference there. Then perhaps with 2/3 nodes active. Maybe try to parse through some Naemon debug output. But unfortunately I am transitioning roles at my current employer and my available time to work on this infrastructure may be very limited (depending on which direction we go with 'observability' in general) hence my response on a Sunday!
I'm opening a new issue for some odd behavior I'm seeing with host notifications. I've seen this behavior occur about 4 times now but I haven't necessarily been able to replicate it intentionally yet.
I was using escalations for host notifications but removed that in favor of delay_first_notification. Sometimes it appears that Naemon/Merlin chooses to send out a host notification earlier than when it should. Take the following example:
Look particularly at 06:32:24. 'node_b' logs a HARD state and immediately sends a notification. At the same time node_a reports that the notification is blocked by a NEB module and will be handled by node_b (that makes sense). But then node_c reports something entirely different:
So for some reason node_c realizes first_notification_delay is in play ... but somehow node_a doesn't. I can confirm that all three nodes have the exact same configuration.
Here is the relevant configuration for host checks: