Closed bobbysmith007 closed 6 years ago
I found the debug log and relevant messages:
notice/ApiListener: Relaying 'event::SetForceNextNotification' message
[2017-08-08 12:30:42 -0400] notice/Notification: Attempting to send notifications for notification object 'icinga2.acceleration.net!NotificationTest!service-test-recovery'.
[2017-08-08 12:30:42 -0400] debug/Notification: Type 'Recovery', TypeFilter: Acknowledgement and Recovery (FType=64, TypeFilter=80)
[2017-08-08 12:30:42 -0400] debug/Notification: User notification, Type 'Recovery', TypeFilter: Acknowledgement, Custom, DowntimeEnd, DowntimeRemoved, DowntimeStart, FlappingEnd, Flapp
ingStart, Problem and Recovery (FType=64, TypeFilter=80)
[2017-08-08 12:30:42 -0400] notice/Notification: We did not notify user 'russ' for a problem before. Not sending recovery notification.
Which seems to have been generated by changes in : https://github.com/Icinga/icinga2/issues/2197
Recoveries will always be sent to users (of a notification) which have been notified about a problem before. That behaviour is intentional.
Thats how I would want, but only 1 please. My problem is that I get two copies of the exact same notification about the service recovery, one from each escalation level (service-test-down-1, and service-test-down-2) or I get none, because the service-test-recovery never notified my user (the other two did).
Perhaps the bug is that it notifies outside of its window (which may be desired for recoveries in some cases, but in an escalation type situation where another notification object is taking over, it seems like we are just going to start getting many extra notifications).
Those are two notification objects (not just one), and each on their own generates a NOT-OK notification event for all assigned users, similar to the OK-Recovery event. This isn't a bug per se.
Right I was hoping there was some piece of configuration I was missing or a misunderstanding on my part that someone could help me with.
The first notification stops sending Not-OK messages after its time window and the second one starts sending "Not-OK" messages with different configuration (basically the exact escalation example in the docs). When the service recovers each notification object reports recovery, even though one of them is outside of the notification window.
So far I cant think of any way to ensure that a user only receives one recovery notification, once a service has escalated. If a bunch of things go down at once, getting double the recovery notifications can be a big deal. We want all of our notifications to be timeboxed (so unimportant notification don't wake folks at 3am) and for things to escalate when they have been down for a while without acknowledgement, both of which seemed fairly straightforward from the docs, but this has been pernicious.
Is there some other way to configure escalating notifications that dont double send recoveries?
Alternatively, is there someway to bypass the "Dont send recoveries to folks who havent been notified restriction" ?
Hm, without any code changes I would use an external notification proxy which handles such duplicate recovery messages. I don't have any at hand, but for a quick solution this might be better than to look into a behavioural change (where I am not sure if that is even possible).
Recoveries are not suppressed by the times attribute, that's correct. Specifically, escalations in Icinga 2 are just additional notifications, nothing extra.
Ok, so I have added a couple functions to icinga to record which notification was sent to our message proxy most recently
globals.set_escalation = function(hostName, serviceName, escalation){
if( !serviceName ){
host = get_host(hostName)
host.vars.current_escalation = escalation
log(LogInformation, "User", "SETTING Escalation "+hostName+" to "+host.vars.escalation)
} else {
service = get_service(hostName, serviceName)
service.vars.current_escalation = escalation
log(LogInformation, "User", "SETTING Escalation "+hostName+"!"+serviceName+" to "+service.vars.escalation)
}
return escalation
}
# used in our notification command:
object NotificationCommand "dispatch-service" {
import "pdispatch"
arguments += {
"--servicename" = "$service.display_name$"
"--escalation" = {{
if (macro("$notification.type$") == "PROBLEM") {
return set_escalation(macro("$host.name$"),macro("$service.name$"), macro("$notification.name$"))
}
}}
}
...
This at least lets me keep the state inside of icinga so that my message proxy can be stateless and still get rid of the duplicate notfications. Then my message proxy can check service.vars.current_escalation
and match against notification.name
for "OK" messages.
To me this seems like a very standard way of operating and am surprised that there is not built in way to send only a single recovery notification in face of escalations. Thanks for your feedback. Not sure if there is anything to do there, but at least there is an answer next time it drives someone nuts. Feel free to close if you think thats best.
Fancy config for debugging, I like that a lot 👍
I think you've got a point with escalations here, but I need to think about it in deep to find an answer or solution. I'll leave it open for discussion with devs & community.
In further debugging on this, I found that TimePeriods are respected, and OK notifications are not sent in TimePeriods where they should not be. This seems fairly inconsistent with the time boxing on a single notification. (IE: time.start and time.end are ignored, but TimePeriods are respected). I am just trying to get a consistent picture of when and which notifications will be sent.
Additionally it seems that in the case of escalations, you might be able to get away with sending OKs only to the base level since they are sent to all levels, such that escalations beyond the first do not send OKs at all. I am a little wary of that and how it relates to TimePeriods . Example: if I escalate from BASE-timeperiod1 to E1-timeperiod1, and subsequently move into a different time period that continues with a E1-timeperiod2 messages, because I never went through BASE-timeperiod2, it would seems like it would never send an OK at all if E1 notifications do not include OK messages.
So I think the stateful representation makes the most sense.
I would handle those specific filters and aggregations with an external tool. Built-in isn't possible as can be seen in this issue.
Icinga2 version 2.6.3-1
In icinga2 monitoring, I want to be able to escalate problem notifications if the service has been down for a certain amount of time, or deescalate if when it stops being business hours. I want to get a single notification when the service comes back up.
When I have both "service-test-down-1" and "service-test-down-2" set to all types and states, I get two "OK" messages when the service becomes ok. When I set it up like below, separating the OK messages and the Not-OK messages, I never get any OKs. I feel like this should be straight forward, but I havent been able to make any progress.
Relevant Doc Links: https://www.icinga.com/docs/icinga2/latest/doc/03-monitoring-basics/#notification-escalations
Printed from icinga as:
Expected Behavior
Get a single OK notification when a service recovers
Current Behavior
Get Two oks for escalated services or none if I try to split OKs into a different notification
Steps to Reproduce (for bugs)
I think the configuration above should reproduce the error
Context
I really want to get a single notification about a service recovering
Your Environment
icinga2 feature list
):icinga2 daemon -C
): Yep its validI also asked this at StackExchange: https://stackoverflow.com/questions/45572450/how-to-get-a-single-acknowledgement-ok-notification-in-icinga2-from-escalating