Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 579 forks source link

Event Commands not actually executing #10094

Open offsides opened 5 months ago

offsides commented 5 months ago

Describe the bug

I have an event handler set up to restart a service when a particular check fails. It doesn't actually execute, no matter what I've tried. When I enable debuglog, I get log messages that it's executing the EventCommand, but the actual command never executes. I have used inotifywait to watch the directory with the script and nothing touches it, I have used strace to trace the Icinga2 daemon and it never tries to execute the command, it's as though the whole event handler system only exists as window-dressing with nothing underneath it. I even set up a test event handler to just log that it ran and nothing happens.

To Reproduce

  1. Create event handler script:
    # cat /usr/lib64/nagios/plugins/eventhandlers/test_event_handler 
    #!/usr/bin/bash
    LOGFILE=/var/log/icinga2/test_event_handler.log
    CMD="'$0'"
    for a in "$@"
    do
        CMD="${CMD} '$a'"
    done
    echo "$$: `date`" >> ${LOGFILE}
    echo "${CMD}" >> ${LOGFILE}
    # ls -al /usr/lib64/nagios/plugins/eventhandlers/test_event_handler
    -rwxr-xr-x. 1 root root 176 Jun 20 11:34 /usr/lib64/nagios/plugins/eventhandlers/test_event_handler
  2. Create the event command:

    object EventCommand "test_event" {
        command = [ PluginDir + "/eventhandlers/test_event_hander" ]
    
        arguments = {
                "-s" = {
                        value = "$service.state$"
                        description = "State of the service being handled."
                }
                "-t" = {
                        value = "$service.state_type$"
                        description = "State type of the service being handled."
                }
                "-a" = {
                        value = "$service.check_attempt$"
                        description = "Check attempt of the service being handled."
                }
        }
    }
  3. Add the relevant pieces to the service definition:
    event_command = "test_event"
    enable_event_handler = true # should not be needed, as it defaults to true
  4. Reload icinga2
  5. Manually cause the service check to change state then look for the log file

Expected behavior

The event handler should run (in the above case, it should create/append to /var/log/icinga2/test_event_handler.log.

Your Environment

Running on RHEL 8 using the Icinga2 packages from the Icinga software repository subscription.

# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.14.2-1)

Copyright (c) 2012-2024 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Red Hat Enterprise Linux
  Platform version: 8.10 (Ootpa)
  Kernel: Linux
  Kernel version: 4.18.0-513.24.1.el8_9.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 8.5.0
  Build host: unknown
  OpenSSL version: OpenSSL 1.1.1k  FIPS 25 Mar 2021

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
# icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf influxdb influxdb2 journald opentsdb perfdata syslog
Enabled features: api checker graphite icingadb livestatus mainlog notification
# icinga2 daemon -C
[2024-06-20 12:14:03 -0400] information/cli: Icinga application loader (version: r2.14.2-1)
[2024-06-20 12:14:03 -0400] information/cli: Loading configuration file(s).
[2024-06-20 12:14:03 -0400] information/ConfigItem: Committing config item(s).
[2024-06-20 12:14:03 -0400] information/ApiListener: My API identity: wamonitor2.iwco.com
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 1 NotificationComponent.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 1 LivestatusListener.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 1 GraphiteWriter.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 2697 Dependencies.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 15 Users.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 13 UserGroups.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 3 TimePeriods.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 1 ServiceGroup.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 2634 Services.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 84 Zones.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 4 NotificationCommands.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 9446 Notifications.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 1 FileLogger.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 431 Hosts.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 27 HostGroups.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 2 EventCommands.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 82 Endpoints.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 1 Comment.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 3 ApiUsers.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 1 ApiListener.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 264 CheckCommands.
[2024-06-20 12:14:05 -0400] information/ConfigItem: Instantiated 1 IcingaDB.
[2024-06-20 12:14:05 -0400] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2024-06-20 12:14:05 -0400] information/cli: Finished validating the configuration file(s).

I do run a Master/Satellite/Server setup, but the affected service and event handler are all on the master.

Al2Klimov commented 3 months ago

Hello and thank you for reporting!

When I enable debuglog, I get log messages that it's executing the EventCommand, but the actual command never executes.

Please share those log messages.

event_command = "test_event"
enable_event_handler = true # should not be needed, as it defaults to true

The complete service config would be interesting, especially whether there's a command endpoint.

Best, A/K

offsides commented 2 months ago

Here's the service definition as well as the log output:

apply Service "Linux Update Status" {
        check_command = "systemd_service"

        assign where host.vars.os == "Linux" && host.vars.agent_endpoint
        command_endpoint = host.vars.agent_endpoint

        vars.systemd_service_name = "linux_update"
        event_command = "test_event"
        enable_event_handler = true
}
[2024-09-06 10:58:29 -0400] debug/CheckerComponent: Scheduling info for checkable 'wamonitor2.iwco.com!Linux Update Status' (2024-09-06 10:58:29 -0400): Object 'wamonitor2.iwco.com!Linux Update Status', Next Check: 2024-09-06 10:58:29 -0400(1.72563e+09).
[2024-09-06 10:58:29 -0400] notice/ApiListener: Relaying 'event::SetForceNextCheck' message
[2024-09-06 10:58:29 -0400] debug/CheckerComponent: Executing check for 'wamonitor2.iwco.com!Linux Update Status'
[2024-09-06 10:58:29 -0400] debug/Checkable: Update checkable 'wamonitor2.iwco.com!Linux Update Status' with check interval '300' from last check time at 2024-09-06 10:58:17 -0400 (1.72563e+09) to next check time at 2024-09-06 11:03:26 -0400 (1.72564e+09).
[2024-09-06 10:58:29 -0400] notice/ApiListener: Relaying 'event::SetLastCheckStarted' message
[2024-09-06 10:58:29 -0400] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2024-09-06 10:58:29 -0400] notice/Process: Running command '/usr/lib64/nagios/plugins/check_systemd_service' 'linux_update': PID 2371701
[2024-09-06 10:58:29 -0400] debug/CheckerComponent: Check finished for object 'wamonitor2.iwco.com!Linux Update Status'
[2024-09-06 10:58:29 -0400] notice/Process: PID 2371701 ('/usr/lib64/nagios/plugins/check_systemd_service' 'linux_update') terminated with exit code 2
[2024-09-06 10:58:29 -0400] debug/Checkable: Update checkable 'wamonitor2.iwco.com!Linux Update Status' with check interval '300' from last check time at 2024-09-06 10:58:29 -0400 (1.72563e+09) to next check time at 2024-09-06 10:59:26 -0400 (1.72563e+09).
[2024-09-06 10:58:29 -0400] debug/IcingaDB: Firing and forgetting query: 'HSET' 'icinga:service:state' '6e36aeeaee810c6a7fe657347f2ead37359468b7' '{"check_attempt":1,"check_commandline":"'/usr/lib64/nagios/pl...'
[2024-09-06 10:58:29 -0400] debug/IcingaDB: Firing and forgetting query: 'HSET' 'icinga:checksum:service:state' '6e36aeeaee810c6a7fe657347f2ead37359468b7' '{"checksum":"e387ecba860ca95c1957f13aa8cef6e35b6cd959"}'
[2024-09-06 10:58:29 -0400] debug/IcingaDB: Firing and forgetting query: 'ZADD' 'icinga:nextupdate:service' '1725634826.871226' '6e36aeeaee810c6a7fe657347f2ead37359468b7'
[2024-09-06 10:58:29 -0400] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2024-09-06 10:58:29 -0400] notice/ApiListener: Relaying 'event::CheckResult' message
[2024-09-06 10:58:29 -0400] debug/IcingaDB: Firing and forgetting query: 'XADD' 'icinga:runtime:state' 'MAXLEN' '~' '1000000' '*' 'runtime_type' ...
[2024-09-06 10:58:29 -0400] notice/Checkable: State Change: Checkable 'wamonitor2.iwco.com!Linux Update Status' soft state change from OK to CRITICAL detected.
[2024-09-06 10:58:29 -0400] notice/Checkable: Executing event handler 'test_event' for checkable 'wamonitor2.iwco.com!Linux Update Status'
[2024-09-06 10:58:29 -0400] notice/ApiListener: Sending message 'event::ExecuteCommand' to 'wamonitor2.iwco.com'
offsides commented 2 months ago

Just another tidbit - I ran strace against my running icinga2 processes following forks, and don't find the string 'test_event' anywhere, making me think that it's never actually being sent. Looking at the source file lib/icinga/checkable-event.cpp, I suspect it may have something to do with the if statement on line 52:

if (endpoint && !GetExtension("agent_check")) {

If that conditional is somehow false, no message gets sent to the listener to perform the check. That said, based on my digging of the code, it looks like the "agent_check" extension only get set in lib/icinga/clusterevents-check.cpp:ClusterEvents::ExecuteCheckFromQueue, but I don't know enough about icinga2's internals to understand if that's involved here or not. For reference, I'm running a master/satellite/endpoint setup, but the issue in question is happening on the master itself. Interestingly, I just tried it on both one of my satellites and one of my endpoints, and it works on both of them, so it's only on the master that the event handlers are failing (and, coincidentally, that's where I need them to run the most...)

offsides commented 2 months ago

It looks like this may be related to #9259, which looks like it's designed to make sure commands don't run on both masters in a HA situation. But it may also be preventing event handlers from running locally on a single host master...