Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2k stars 574 forks source link

How to get a single acknowledgement / OK notification in icinga2 from escalating notification templates? #5478

Closed bobbysmith007 closed 6 years ago

bobbysmith007 commented 7 years ago

Icinga2 version 2.6.3-1

In icinga2 monitoring, I want to be able to escalate problem notifications if the service has been down for a certain amount of time, or deescalate if when it stops being business hours. I want to get a single notification when the service comes back up.

When I have both "service-test-down-1" and "service-test-down-2" set to all types and states, I get two "OK" messages when the service becomes ok. When I set it up like below, separating the OK messages and the Not-OK messages, I never get any OKs. I feel like this should be straight forward, but I havent been able to make any progress.

Relevant Doc Links: https://www.icinga.com/docs/icinga2/latest/doc/03-monitoring-basics/#notification-escalations

apply Notification "service-test-down-1" to Service {
  command = "dispatch-service"
  states = [ Warning, Critical, Unknown ]
  types = [ Problem, Custom, FlappingStart, FlappingEnd,
            DowntimeStart, DowntimeEnd, DowntimeRemoved ]
  users = ["russ"]
  period = "24x7"
  assign where "tests" in service.groups
  vars.priority = "medium"
  times.begin = 0m
  times.end = 3m
  interval = 1m
}

apply Notification "service-test-down-2" to Service {
  command = "dispatch-service"
  states = [ Warning, Critical, Unknown ]
  types = [ Problem, Custom, FlappingStart, FlappingEnd,
            DowntimeStart, DowntimeEnd, DowntimeRemoved ]
  period = "24x7"
  users = ["russ"]
  assign where "tests" in service.groups
  vars.priority = "medium"
  times.begin = 3m
  times.end = 12h
  interval = 2m
}
apply Notification "service-test-recovery" to Service {
  command = "dispatch-service"
  states = [ OK ]
  types = [ Acknowledgement, Recovery ]
  users = ["russ"]
  period = "24x7"
  vars.priority = "medium"
  assign where "tests" in service.groups
  interval = 0
}
apply Service "NotificationTest" {
  enable_active_checks = true
  check_command = "passive"
  max_check_attempts = 1

  ignore where host.vars.noservices == true
  groups += ["tests"]
  assign where host.name == "icinga2.acceleration.net"
  max_check_attempts = 5
  check_interval = 5m
  retry_interval = 5m
}

Printed from icinga as:


~# icinga2 object list --name service-test-* 
Object 'icinga2.acceleration.net!NotificationTest!service-test-down-1' of type 'Notification':
  % declared in '/opt/icinga2lib/lib.conf.d//test.conf', lines 2:1-2:51
  * __name = "icinga2.acceleration.net!NotificationTest!service-test-down-1"
  * command = "dispatch-service"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 3:3-3:30
  * command_endpoint = ""
  * host_name = "icinga2.acceleration.net"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 2:1-2:51
  * interval = 60
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 13:3-13:15
  * name = "service-test-down-1"
  * package = "_etc"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 2:1-2:51
  * period = "24x7"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 8:3-8:17
  * service_name = "NotificationTest"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 2:1-2:51
  * states = [ "Warning", "Critical", "Unknown" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 4:3-4:41
  * templates = [ "service-test-down-1" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 2:1-2:51
  * times
    * begin = 0
      % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 11:3-11:18
    * end = 180
      % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 12:3-12:16
  * type = "Notification"
  * types = [ "Problem", "Custom", "FlappingStart", "FlappingEnd", "DowntimeStart", "DowntimeEnd", "DowntimeRemoved" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 5:3-6:57
  * user_groups = null
  * users = [ "russ" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 7:3-7:18
  * vars
    * priority = "medium"
      % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 10:3-10:26
  * zone = ""

Object 'icinga2.acceleration.net!NotificationTest!service-test-down-2' of type 'Notification':
  % declared in '/opt/icinga2lib/lib.conf.d//test.conf', lines 16:1-16:51
  * __name = "icinga2.acceleration.net!NotificationTest!service-test-down-2"
  * command = "dispatch-service"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 17:3-17:30
  * command_endpoint = ""
  * host_name = "icinga2.acceleration.net"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 16:1-16:51
  * interval = 120
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 27:3-27:15
  * name = "service-test-down-2"
  * package = "_etc"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 16:1-16:51
  * period = "24x7"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 21:3-21:17
  * service_name = "NotificationTest"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 16:1-16:51
  * states = [ "Warning", "Critical", "Unknown" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 18:3-18:41
  * templates = [ "service-test-down-2" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 16:1-16:51
  * times
    * begin = 180
      % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 25:3-25:18
    * end = 43200
      % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 26:3-26:17
  * type = "Notification"
  * types = [ "Problem", "Custom", "FlappingStart", "FlappingEnd", "DowntimeStart", "DowntimeEnd", "DowntimeRemoved" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 19:3-20:57
  * user_groups = null
  * users = [ "russ" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 22:3-22:18
  * vars
    * priority = "medium"
      % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 24:3-24:26
  * zone = ""

Object 'icinga2.acceleration.net!NotificationTest!service-test-recovery' of type 'Notification':
  % declared in '/opt/icinga2lib/lib.conf.d//test.conf', lines 29:1-29:53
  * __name = "icinga2.acceleration.net!NotificationTest!service-test-recovery"
  * command = "dispatch-service"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 30:3-30:30
  * command_endpoint = ""
  * host_name = "icinga2.acceleration.net"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 29:1-29:53
  * interval = 0
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 37:3-37:14
  * name = "service-test-recovery"
  * package = "_etc"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 29:1-29:53
  * period = "24x7"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 34:3-34:17
  * service_name = "NotificationTest"
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 29:1-29:53
  * states = [ "OK" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 31:3-31:17
  * templates = [ "service-test-recovery" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 29:1-29:53
  * times = null
  * type = "Notification"
  * types = [ "Acknowledgement", "Recovery" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 32:3-32:39
  * user_groups = null
  * users = [ "russ" ]
    % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 33:3-33:18
  * vars
    * priority = "medium"
      % = modified in '/opt/icinga2lib/lib.conf.d//test.conf', lines 35:3-35:26
  * zone = ""

Expected Behavior

Get a single OK notification when a service recovers

Current Behavior

Get Two oks for escalated services or none if I try to split OKs into a different notification

Steps to Reproduce (for bugs)

I think the configuration above should reproduce the error

Context

I really want to get a single notification about a service recovering

Your Environment

I also asked this at StackExchange: https://stackoverflow.com/questions/45572450/how-to-get-a-single-acknowledgement-ok-notification-in-icinga2-from-escalating

bobbysmith007 commented 7 years ago

I found the debug log and relevant messages:

notice/ApiListener: Relaying 'event::SetForceNextNotification' message
[2017-08-08 12:30:42 -0400] notice/Notification: Attempting to send  notifications for notification object 'icinga2.acceleration.net!NotificationTest!service-test-recovery'.
[2017-08-08 12:30:42 -0400] debug/Notification: Type 'Recovery', TypeFilter: Acknowledgement and Recovery (FType=64, TypeFilter=80)
[2017-08-08 12:30:42 -0400] debug/Notification: User notification, Type 'Recovery', TypeFilter: Acknowledgement, Custom, DowntimeEnd, DowntimeRemoved, DowntimeStart, FlappingEnd, Flapp
ingStart, Problem and Recovery (FType=64, TypeFilter=80)
[2017-08-08 12:30:42 -0400] notice/Notification: We did not notify user 'russ' for a problem before. Not sending recovery notification.

Which seems to have been generated by changes in : https://github.com/Icinga/icinga2/issues/2197

dnsmichi commented 7 years ago

Recoveries will always be sent to users (of a notification) which have been notified about a problem before. That behaviour is intentional.

bobbysmith007 commented 7 years ago

Thats how I would want, but only 1 please. My problem is that I get two copies of the exact same notification about the service recovery, one from each escalation level (service-test-down-1, and service-test-down-2) or I get none, because the service-test-recovery never notified my user (the other two did).

Perhaps the bug is that it notifies outside of its window (which may be desired for recoveries in some cases, but in an escalation type situation where another notification object is taking over, it seems like we are just going to start getting many extra notifications).

dnsmichi commented 7 years ago

Those are two notification objects (not just one), and each on their own generates a NOT-OK notification event for all assigned users, similar to the OK-Recovery event. This isn't a bug per se.

bobbysmith007 commented 7 years ago

Right I was hoping there was some piece of configuration I was missing or a misunderstanding on my part that someone could help me with.

The first notification stops sending Not-OK messages after its time window and the second one starts sending "Not-OK" messages with different configuration (basically the exact escalation example in the docs). When the service recovers each notification object reports recovery, even though one of them is outside of the notification window.

So far I cant think of any way to ensure that a user only receives one recovery notification, once a service has escalated. If a bunch of things go down at once, getting double the recovery notifications can be a big deal. We want all of our notifications to be timeboxed (so unimportant notification don't wake folks at 3am) and for things to escalate when they have been down for a while without acknowledgement, both of which seemed fairly straightforward from the docs, but this has been pernicious.

Is there some other way to configure escalating notifications that dont double send recoveries?

Alternatively, is there someway to bypass the "Dont send recoveries to folks who havent been notified restriction" ?

dnsmichi commented 7 years ago

Hm, without any code changes I would use an external notification proxy which handles such duplicate recovery messages. I don't have any at hand, but for a quick solution this might be better than to look into a behavioural change (where I am not sure if that is even possible).

Recoveries are not suppressed by the times attribute, that's correct. Specifically, escalations in Icinga 2 are just additional notifications, nothing extra.

bobbysmith007 commented 7 years ago

Ok, so I have added a couple functions to icinga to record which notification was sent to our message proxy most recently

globals.set_escalation = function(hostName, serviceName, escalation){
    if( !serviceName ){
        host = get_host(hostName)
        host.vars.current_escalation = escalation
        log(LogInformation, "User", "SETTING Escalation "+hostName+" to "+host.vars.escalation)
    } else {
        service = get_service(hostName, serviceName)
        service.vars.current_escalation = escalation
        log(LogInformation, "User", "SETTING Escalation "+hostName+"!"+serviceName+" to "+service.vars.escalation)
    }

    return escalation
}

# used in our notification command:
  object NotificationCommand "dispatch-service" {
    import "pdispatch"

    arguments += {
      "--servicename" = "$service.display_name$"
      "--escalation" = {{
         if (macro("$notification.type$") == "PROBLEM") {
           return set_escalation(macro("$host.name$"),macro("$service.name$"), macro("$notification.name$"))
         } 
      }}
    }
...

This at least lets me keep the state inside of icinga so that my message proxy can be stateless and still get rid of the duplicate notfications. Then my message proxy can check service.vars.current_escalation and match against notification.name for "OK" messages.

To me this seems like a very standard way of operating and am surprised that there is not built in way to send only a single recovery notification in face of escalations. Thanks for your feedback. Not sure if there is anything to do there, but at least there is an answer next time it drives someone nuts. Feel free to close if you think thats best.

dnsmichi commented 7 years ago

Fancy config for debugging, I like that a lot 👍

I think you've got a point with escalations here, but I need to think about it in deep to find an answer or solution. I'll leave it open for discussion with devs & community.

bobbysmith007 commented 7 years ago

In further debugging on this, I found that TimePeriods are respected, and OK notifications are not sent in TimePeriods where they should not be. This seems fairly inconsistent with the time boxing on a single notification. (IE: time.start and time.end are ignored, but TimePeriods are respected). I am just trying to get a consistent picture of when and which notifications will be sent.


Additionally it seems that in the case of escalations, you might be able to get away with sending OKs only to the base level since they are sent to all levels, such that escalations beyond the first do not send OKs at all. I am a little wary of that and how it relates to TimePeriods . Example: if I escalate from BASE-timeperiod1 to E1-timeperiod1, and subsequently move into a different time period that continues with a E1-timeperiod2 messages, because I never went through BASE-timeperiod2, it would seems like it would never send an OK at all if E1 notifications do not include OK messages.

So I think the stateful representation makes the most sense.

dnsmichi commented 6 years ago

I would handle those specific filters and aggregations with an external tool. Built-in isn't possible as can be seen in this issue.