Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.01k stars 576 forks source link

ScheduleDowntime skips time periods #9858

Open minatoyama opened 1 year ago

minatoyama commented 1 year ago

Description

We use scheduled downtimes on several of our checks to suppress alerts when the underlying systems are closed down during nights and weekends. Now and then several of them skips the next period and schedules the next after that instead, making a lot of alerts trigger. It is rarely just one of them, but three to five out of perhaps twenty.

An example of such configuration looks like this:

zones.d/master/scheduled_downtime_apply.conf
apply ScheduledDowntime "<name of downtime>" to Service {
    author = "<author>"
    comment = "The service is shutdown between 23:50 and 06:30"
    fixed = true
    assign where service.name == "<service name>"
    ranges = {
        "friday"    = "00:00-06:30,23:50-24:00"
        "monday"    = "00:00-06:30,23:50-24:00"
        "saturday"  = "00:00-24:00"
        "sunday"    = "00:00-24:00"
        "thursday"  = "00:00-06:30,23:50-24:00"
        "tuesday"   = "00:00-06:30,23:50-24:00"
        "wednesday" = "00:00-06:30,23:50-24:00"
    }
} 

I do not have debugging on in production, but I have managed to reproduce this in our test environment using a lot of time periods per day in the ranges. See example under reproduce.

Reproduction and example from our test environment

Create a scheduled downtime with a lot of ranges. My example looks like this:

apply ScheduledDowntime "Downtime bug tester touch 2" to Service {
    author = "<author>"
    comment = "This is just to test the rescheduling bug"
    fixed = true
    assign where service.name == "Downtime Alert" && host.name == "icinga1.test.<domain>"
    ranges = {
        "friday"    = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
        "monday"    = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
        "saturday"  = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
        "sunday"    = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
        "thursday"  = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
        "tuesday"   = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
        "wednesday" = "00:00-06:00,06:00-07:00,07:00-08:00,08:00-09:00,09:00-10:00,10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00"
    }
}

Create a passive service (named Downtime Alert in my example) and put it in CRITICAL state. The notification should be suppressed due to the scheduled downtime.

Wait until the service is not in scheduled downtime and a notification is sent.

Today, a Thursday, my test setup skipped the rest of the Thursday range and added the Friday at midnight range as the next downtime instead when it was supposed to reschedule it at 10:00 CET. This triggered the notification. The created downtime looked like this:

object Downtime "f8203466-f753-4a2d-9382-49266f3f070f" ignore_on_error {
        author = "<author>"
        authoritative_zone = "master"
        comment = "This is just to test the rescheduling bug"
        config_owner = "icinga1.test.<domain>!Downtime Alert!Downtime bug tester touch 2"
        config_owner_hash = "1d25acc66530b563012bd9ca5b6389fb0c2451cfa3e3f8528766ba8afc740edb"
        duration = 0.000000
        end_time = 1693540800.000000
        entry_time = 1693468800.829512
        fixed = true
        host_name = "icinga1.test.<domain>"
        parent = ""
        scheduled_by = "icinga1.test.<domain>!Downtime Alert!Downtime bug tester touch 2"
        service_name = "Downtime Alert"
        start_time = 1693519200.000000
        triggered_by = ""
        version = 1693468800.830124
        zone = "master"
}

So it was created at 2023-08-31 10:00:00, but did not cover any of these periods in the Thursday range: 10:00-11:00,11:00-11:30,11:30-12:00,12:00-12:30,12:30-13:00,13:00-13:30,13:30-14:00,14:00-14:30,14:30-15:00,15:00-15:30,15:30-16:00,16:00-17:00,17:00-18:00,18:00-19:00,19:00-20:00,20:00-24:00. Instead it has the start time 2023-09-01 00:00:00 and end time 2023-09-01 06:00:00, the first from the Friday range.

Expected behaviour

The correct following range of downtime should be created, not the next one after that.

My environment

minatoyama commented 1 year ago

I cannot submit the whole debug log due to sensitive data, but here is a excerpt of it containing debug/ScheduledDowntime data. debug_log_bug_caught.log.gz

Al2Klimov commented 6 months ago