Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
1.99k stars 573 forks source link

Flapping and Downtime notifications don't respect `states` and `times` configuration #9842

Open davidwinterstein opened 1 year ago

davidwinterstein commented 1 year ago

Describe the bug

Downtimes

Downtime notifications don't seem to respect either of the notification object's states or times configuration.

A notification object that has

will immediately trigger a notification when a Downtime starts, ends or is removed and the service state is OK, regardless of when the Downtime is created or ends.

Flapping

Flapping notifications don't seem to respect the notification object's times configuration.

A notification object that has

will immediately trigger a notification when the service is considered flapping.

To Reproduce

  1. Create a service that you can easily control the state of (i.e. a custom script that reads the return code from a config file and immediately exits using that code) and that has enable_flapping = true.
  2. Create the previously mentioned notification objects assigned to the service.
  3. Downtime:
    • Create a Downtime on the service while it is in OK state (i.e. via IcingaWeb2).
    • Immediately receive a notification for the DowntimeStart.
    • Remove the Downtime on the service.
    • Immediately receive a notification for the DowntimeEnd.
  4. Flapping:
    • Let the service flap between OK and Warning states until it is considered flapping.
    • Immediately receive a notification for the FlappingStart.

Expected behavior

A notification should only be sent for Downtime events if the service state matches the notification object's states configuration.

While the service is OK, I do not wish to receive a notification if a Downtime is started or ends for the service - the purpose of creating a Downtime on the fly is to not receive a (problem) notification, i.e. during a planned reboot.
I do however wish to receive a DowntimeStart notification without any delay if a Problem notification has been sent for the service before, since it indicates that someone else is working on the problem already and is therefore comparable to an Acknowledged notification.
I also wish to receive a DowntimeEnd or DowntimeRemoved notification when the service is not OK while the Downtime is removed ~and continues to be not OK for the duration of times.begin (or maybe another optionally configurable delay)~ since it indicates that the person who created the Downtime does not seem to still work on the problem (or they should have extended the Downtime).

A notification should not immediately be sent for Flapping events but after a configurable duration.

I do not wish to receive a notification as soon as a service is considered flapping but I wish to receive one once it has been flapping for the percentage configured with flapping_threshold_low/flapping_threshold_high over the duration of times.begin (or maybe another optionally configurable delay).
I.e. I do not wish to trigger a phone call in the middle of the night because a http check is flapping for 5 minutes, but I do wish to trigger one if it is flapping for more than half an hour.
Maybe it would be enough to make the amount of events considered for flapping detection configurable, although the events are rather random and completely depend on the check interval, or rather use a combination of a duration and number of events that can be set per service if a notification delay is not desired.

Your Environment

Include as many relevant details about the environment you experienced the problem in

$ icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.14.0-1)

Copyright (c) 2012-2023 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Debian GNU/Linux
  Platform version: 11 (bullseye)
  Kernel: Linux
  Kernel version: 5.10.0-23-amd64
  Architecture: x86_64

Build information:
  Compiler: GNU 10.2.1
  Build host: runner-hh8q3bz2-project-575-concurrent-0
  OpenSSL version: OpenSSL 1.1.1n  15 Mar 2022

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb2 journald livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker ido-mysql influxdb mainlog notification
Icinga Web 2 Version: 2.11.4
Git commit: 11453bfa92a70a44efbf7f966f5e7f27e9300a28
Git commit date: 2023-01-26
PHP Version: 8.0.29
$ icinga2 daemon -C
[2023-08-07 12:52:18 +0200] information/cli: Icinga application loader (version: r2.14.0-1)
[2023-08-07 12:52:18 +0200] information/cli: Loading configuration file(s).
[2023-08-07 12:52:19 +0200] information/ConfigItem: Committing config item(s).
[2023-08-07 12:52:19 +0200] information/ApiListener: My API identity: icinga-master.cmpsrv.com
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 6 NotificationCommands.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 109299 Notifications.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 437 Hosts.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 1092 Downtimes.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 9 Comments.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 439 Zones.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 437 Endpoints.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 5 ApiUsers.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 262 CheckCommands.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 4 TimePeriods.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 2 UserGroups.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 10 Users.
[2023-08-07 12:52:25 +0200] information/ConfigItem: Instantiated 20979 Services.
[2023-08-07 12:52:25 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2023-08-07 12:52:25 +0200] information/cli: Finished validating the configuration file(s).
julianbrost commented 1 year ago

states and times only affects state notifications at the moment.

I do however wish to receive a DowntimeStart notification without any delay if a Problem notification has been sent for the service before, since it indicates that someone else is working on the problem already and is therefore comparable to a Recovery notification.

This sounds more like a use case for acknowledgements to me, not for downtimes.

I also wish to receive a DowntimeEnd or DowntimeRemoved notification when the service is not OK while the Downtime is removed and continues to be not OK for the duration of times.begin (or maybe another optionally configurable delay) since it indicates that the person who created the Downtime does not seem to still work on the problem (or they should have extended the Downtime).

How would this extra delay differ from just making the downtime longer?

davidwinterstein commented 11 months ago

states and times only affects state notifications at the moment.

Is there a plan to include flapping and downtime notifications in the future? The current implementation renders those quite unusable for our environment.

I do however wish to receive a DowntimeStart notification without any delay if a Problem notification has been sent for the service before, since it indicates that someone else is working on the problem already and is therefore comparable to a Recovery notification.

This sounds more like a use case for acknowledgements to me, not for downtimes.

Maybe I am using this wrong, but the main difference between a downtime and an acknowledgement seems to me that an acknowledgement is automatically removed when the check recovers while a downtime is not.
So if I am working on a problem and I know that the check might recover and break again a couple of times, I am using a downtime.
Is there a way to preserve an acknowledgement for the specified acknowledgement duration even when the check recovers? The sticky flag for acknowledgements does not currently do this for me.

I also wish to receive a DowntimeEnd or DowntimeRemoved notification when the service is not OK while the Downtime is removed and continues to be not OK for the duration of times.begin (or maybe another optionally configurable delay) since it indicates that the person who created the Downtime does not seem to still work on the problem (or they should have extended the Downtime).

How would this extra delay differ from just making the downtime longer?

Now that I'm thinking about it, the delay is not really required (I edited the issue to reflect this). The important part is to only receive a DowntimeEnd or DowntimeRemoved notification when the service is not OK upon the downtime ending.

Al2Klimov commented 4 months ago

The important part is to only receive a DowntimeEnd or DowntimeRemoved notification when the service is not OK upon the downtime ending.

Could you pass the current service state to your Downtime notification script and skip if OK?

davidwinterstein commented 3 months ago

Sorry for the late answer.

I guess that would work. Currently I use the default mail notification scripts provided from Icinga, so I'll have to adjust those. But it's a good enough workaround, I guess. Would still be great to be able to configure this through Icinga in the long run.