We use scheduled downtimes on several of our checks to suppress alerts when the underlying systems are closed down during nights and weekends. Now and then several of them skips the next period and schedules the next after that instead, making a lot of alerts trigger. It is rarely just one of them, but three to five out of perhaps twenty.
An example of such configuration looks like this:
zones.d/master/scheduled_downtime_apply.conf
apply ScheduledDowntime "<name of downtime>" to Service {
author = "<author>"
comment = "The service is shutdown between 23:50 and 06:30"
fixed = true
assign where service.name == "<service name>"
ranges = {
"friday" = "00:00-06:30,23:50-24:00"
"monday" = "00:00-06:30,23:50-24:00"
"saturday" = "00:00-24:00"
"sunday" = "00:00-24:00"
"thursday" = "00:00-06:30,23:50-24:00"
"tuesday" = "00:00-06:30,23:50-24:00"
"wednesday" = "00:00-06:30,23:50-24:00"
}
}
I do not have debugging on in production, but I have managed to reproduce this in our test environment using a lot of time periods per day in the ranges. See example under reproduce.
Reproduction and example from our test environment
Create a scheduled downtime with a lot of ranges. My example looks like this:
apply ScheduledDowntime "Downtime bug tester touch 2" to Service {
author = "<author>"
comment = "This is just to test the rescheduling bug"
fixed = true
assign where service.name == "Downtime Alert" && host.name == "icinga1.test.<domain>"
ranges = {
"friday" = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
"monday" = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
"saturday" = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
"sunday" = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
"thursday" = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
"tuesday" = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
"wednesday" = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
}
}
Create a passive service (named Downtime Alert in my example) and put it in CRITICAL state. The notification should be suppressed due to the scheduled downtime.
Wait until the service is not in scheduled downtime and a notification is sent.
Today, a Thursday, my test setup skipped the rest of the Thursday range and added the Friday at midnight range as the next downtime instead when it was supposed to reschedule it at 10:00 CET. This triggered the notification. The created downtime looked like this:
So it was created at 2023-08-31 10:00:00, but did not cover any of these periods in the Thursday range: 10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00. Instead it has the start time 2023-09-01 00:00:00 and end time 2023-09-01 06:00:00, the first from the Friday range.
Expected behaviour
The correct following range of downtime should be created, not the next one after that.
My environment
Version used (icinga2 --version): r2.14.0-1
Operating System and version: CentOS 7, kernel version 3.10.0-1127.el7.x86_64
Enabled features (icinga2 feature list): api checker debuglog icingadb mainlog notification
Icinga Web 2 version and modules (System - About): Icinga Web 2 2.11.4, php-library 0.12.0, php-thirdparty 0.11.0, director 1.10.2, icingadb module 1.0.2, incubator 0.20.0. IcingaDB 1.1.1, IcingaDB Redis 7.0.5.
I cannot submit the whole debug log due to sensitive data, but here is a excerpt of it containing debug/ScheduledDowntime data.
debug_log_bug_caught.log.gz
Description
We use scheduled downtimes on several of our checks to suppress alerts when the underlying systems are closed down during nights and weekends. Now and then several of them skips the next period and schedules the next after that instead, making a lot of alerts trigger. It is rarely just one of them, but three to five out of perhaps twenty.
An example of such configuration looks like this:
I do not have debugging on in production, but I have managed to reproduce this in our test environment using a lot of time periods per day in the ranges. See example under reproduce.
Reproduction and example from our test environment
Create a scheduled downtime with a lot of ranges. My example looks like this:
Create a passive service (named Downtime Alert in my example) and put it in CRITICAL state. The notification should be suppressed due to the scheduled downtime.
Wait until the service is not in scheduled downtime and a notification is sent.
Today, a Thursday, my test setup skipped the rest of the Thursday range and added the Friday at midnight range as the next downtime instead when it was supposed to reschedule it at 10:00 CET. This triggered the notification. The created downtime looked like this:
So it was created at
2023-08-31 10:00:00
, but did not cover any of these periods in the Thursday range:10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00
. Instead it has the start time2023-09-01 00:00:00
and end time2023-09-01 06:00:00
, the first from the Friday range.Expected behaviour
The correct following range of downtime should be created, not the next one after that.
My environment
icinga2 --version
): r2.14.0-1icinga2 feature list
): api checker debuglog icingadb mainlog notification