Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.01k stars 577 forks source link

Icinga2 does not generate DOWNTIMESTART notifications for hosts with state down. #5202

Closed jkroepke closed 2 years ago

jkroepke commented 7 years ago

If we create a fixed downtime for a host which is already in a DOWN state, icinga2 does not generate a DOWNTIMESTART notification.

Flexible Downtimes sends DOWNTIMESTART notification.

Expected Behavior

All Downtimes send a DOWNTIMESTART notification

Current Behavior

See description.

Possible Solution

Steps to Reproduce (for bugs)

  1. Set a Downtime on a host with state down.
  2. Check our mailbox .. there is nothing

Context

We have a inhouse SLA reporting and incident tool inhouse. All actions from the icinga are transfered via notifications to the reporting tool.

Your Environment

dnsmichi commented 7 years ago

How about the parameters you pass to such a downtime - start/end time, etc. Best would be a curl request against the REST API to easily reproduce the issue.

jkroepke commented 7 years ago

Okay, here are the curl requests:

Host is online:

curl -H "Accept: application/json" -k -s -u root -X POST -d '{ "type": "Host", "filter": "host.name==\"batman\"", "start_time": '$(date +%s)', "end_time": '$(date +%s --date="+30 seconds")', "author": "root", "comment": "test", "fixed": true, "duration": 30 }' -k "https://localhost:5665/v1/actions/schedule-downtime"
{"results":[{"code":200.0,"legacy_id":47.0,"name":"batman!icinga-1493796888-10","status":"Successfully scheduled downtime 'batman!icinga-1493796888-10' for object 'batman'."}]}
curl -H "Accept: application/json" -k -s -u root -X POST 'https://localhost:5665/v1/events?queue=debugnotifications&types=Notification'
......
{"author":"root","check_result":{"active":true,"check_source":"icinga","command":["/usr/local/monitoring/libexec/default/check_ping","-H","10.204.7.69","-c","5000,100%","-w","3000,80%"],"execution_end":1493796798.4722080231,"execution_start":1493796794.4702019691,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 0.42 ms","performance_data":null,"schedule_end":1493796798.4722321033,"schedule_start":1493796794.4699997902,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"batman","notification_type":"DOWNTIMESTART","text":"test","timestamp":1493796892.8843939304,"type":"Notification","users":["Team_Middleware"]}

Host is offline (Soft 1/3):

curl -H "Accept: application/json" -k -s -u root -X POST -d '{ "type": "Host", "filter": "host.name==\"batman\"", "start_time": '$(date +%s)', "end_time": '$(date +%s --date="+30 seconds")', "author": "root", "comment": "test", "fixed": true, "duration": 30 }' -k "https://localhost:5665/v1/actions/schedule-downtime"
{"results":[{"code":200.0,"legacy_id":48.0,"name":"batman!icinga-1493797061-11","status":"Successfully scheduled downtime 'batman!icinga-1493797061-11' for object 'batman'."}]}
curl -H "Accept: application/json" -k -s -u root -X POST 'https://localhost:5665/v1/events?queue=debugnotifications&types=Notification'
......

Host is offline (Hard 3/3):

curl -H "Accept: application/json" -k -s -u root -X POST -d '{ "type": "Host", "filter": "host.name==\"batman\"", "start_time": '$(date +%s)', "end_time": '$(date +%s --date="+30 seconds")', "author": "root", "comment": "test", "fixed": true, "duration": 30 }' -k "https://localhost:5665/v1/actions/schedule-downtime"
{"results":[{"code":200.0,"legacy_id":49.0,"name":"batman!icinga-1493797144-12","status":"Successfully scheduled downtime 'batman!icinga-1493797144-12' for object 'batman'."}]}
curl -H "Accept: application/json" -k -s -u root -X POST 'https://localhost:5665/v1/events?queue=debugnotifications&types=Notification'
......
dnsmichi commented 7 years ago

Hm, I have an idea about the hosts raw state which influences the downtime trigger in lib/icinga/downtime.cpp:137. Can you extract the attribute last_check_result for the affected host via /v1/objects/hosts for all three tests of yours? I would believe that your host contains last_check_result.state which is set to 1 and not 0.

jkroepke commented 7 years ago

hm. It's 2.

curl -H "Accept: application/json" -k -s -u root "https://localhost:5665/v1/objects/hosts/batman" | python -m json.tool
{
    "results": [
        {
            "attrs": {
                "__name": "batman",
                ...
               "last_check_result": {
                    "active": false,
                    "check_source": "icinga",
                    "command": null,
                    "execution_end": 1493812075.0,
                    "execution_start": 1493812075.0,
                    "exit_status": 0.0,
                    "output": "DOWN",
                    "performance_data": [],
                    "schedule_end": 1493812075.0,
                    "schedule_start": 1493812075.0,
                    "state": 2.0,
                    "type": "CheckResult",
                    "vars_after": {
                        "attempt": 1.0,
                        "reachable": true,
                        "state": 2.0,
                        "state_type": 0.0
                    },
                    "vars_before": {
                        "attempt": 1.0,
                        "reachable": true,
                        "state": 0.0,
                        "state_type": 1.0
                    }
                },
                "last_hard_state": 0.0,
                "last_hard_state_change": 1493797254.984923,
                "last_reachable": true,
                ...
            }
        }
    ]
}

It's does not matter, if the check executed active by icinga or send the check result passively via API/icingaweb2.

jkroepke commented 7 years ago

@dnsmichi Any news? Do you need more informations?

dnsmichi commented 7 years ago

I have a possible fix in my stash, but I did not yet reproduce the issue (working on other issues atm).

diff --git a/lib/icinga/downtime.cpp b/lib/icinga/downtime.cpp
index 909ba7e8f..056e78cdf 100644
--- a/lib/icinga/downtime.cpp
+++ b/lib/icinga/downtime.cpp
@@ -134,7 +134,7 @@ void Downtime::Start(bool runtimeCreated)
         * this downtime now *after* it has been added (important
         * for DB IDO, etc.)
         */
-       if (checkable->GetStateRaw() != ServiceOK) {
+       if (!checkable->IsStateOK(checkable->GetStateRaw()) {
                Log(LogNotice, "Downtime")
                    << "Checkable '" << checkable->GetName() << "' already in a NOT-OK state."
                    << " Triggering downtime now.";
jkroepke commented 7 years ago

We have this problem only w/ fix downtimes. Flexible downtimes are fine.

dnsmichi commented 7 years ago

Thanks, that helps reproducing it.

jkroepke commented 7 years ago

@dnsmichi This problem still exists in 2.7.0-r1

fscaptain commented 7 years ago

@dnsmichi same behavior for services which are current not in OK State. (Icinga2 version: r2.7.1-1)

ghost commented 6 years ago

r2.8.2-1 - same problem for services. If a service in failed state - putting it to downtime does not trigger a notification

tktr commented 2 years ago

Same here with version r2.13.1