Open offsides opened 5 months ago
Hello and thank you for reporting!
When I enable debuglog, I get log messages that it's executing the EventCommand, but the actual command never executes.
Please share those log messages.
event_command = "test_event" enable_event_handler = true # should not be needed, as it defaults to true
The complete service config would be interesting, especially whether there's a command endpoint.
Best, A/K
Here's the service definition as well as the log output:
apply Service "Linux Update Status" {
check_command = "systemd_service"
assign where host.vars.os == "Linux" && host.vars.agent_endpoint
command_endpoint = host.vars.agent_endpoint
vars.systemd_service_name = "linux_update"
event_command = "test_event"
enable_event_handler = true
}
[2024-09-06 10:58:29 -0400] debug/CheckerComponent: Scheduling info for checkable 'wamonitor2.iwco.com!Linux Update Status' (2024-09-06 10:58:29 -0400): Object 'wamonitor2.iwco.com!Linux Update Status', Next Check: 2024-09-06 10:58:29 -0400(1.72563e+09).
[2024-09-06 10:58:29 -0400] notice/ApiListener: Relaying 'event::SetForceNextCheck' message
[2024-09-06 10:58:29 -0400] debug/CheckerComponent: Executing check for 'wamonitor2.iwco.com!Linux Update Status'
[2024-09-06 10:58:29 -0400] debug/Checkable: Update checkable 'wamonitor2.iwco.com!Linux Update Status' with check interval '300' from last check time at 2024-09-06 10:58:17 -0400 (1.72563e+09) to next check time at 2024-09-06 11:03:26 -0400 (1.72564e+09).
[2024-09-06 10:58:29 -0400] notice/ApiListener: Relaying 'event::SetLastCheckStarted' message
[2024-09-06 10:58:29 -0400] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2024-09-06 10:58:29 -0400] notice/Process: Running command '/usr/lib64/nagios/plugins/check_systemd_service' 'linux_update': PID 2371701
[2024-09-06 10:58:29 -0400] debug/CheckerComponent: Check finished for object 'wamonitor2.iwco.com!Linux Update Status'
[2024-09-06 10:58:29 -0400] notice/Process: PID 2371701 ('/usr/lib64/nagios/plugins/check_systemd_service' 'linux_update') terminated with exit code 2
[2024-09-06 10:58:29 -0400] debug/Checkable: Update checkable 'wamonitor2.iwco.com!Linux Update Status' with check interval '300' from last check time at 2024-09-06 10:58:29 -0400 (1.72563e+09) to next check time at 2024-09-06 10:59:26 -0400 (1.72563e+09).
[2024-09-06 10:58:29 -0400] debug/IcingaDB: Firing and forgetting query: 'HSET' 'icinga:service:state' '6e36aeeaee810c6a7fe657347f2ead37359468b7' '{"check_attempt":1,"check_commandline":"'/usr/lib64/nagios/pl...'
[2024-09-06 10:58:29 -0400] debug/IcingaDB: Firing and forgetting query: 'HSET' 'icinga:checksum:service:state' '6e36aeeaee810c6a7fe657347f2ead37359468b7' '{"checksum":"e387ecba860ca95c1957f13aa8cef6e35b6cd959"}'
[2024-09-06 10:58:29 -0400] debug/IcingaDB: Firing and forgetting query: 'ZADD' 'icinga:nextupdate:service' '1725634826.871226' '6e36aeeaee810c6a7fe657347f2ead37359468b7'
[2024-09-06 10:58:29 -0400] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2024-09-06 10:58:29 -0400] notice/ApiListener: Relaying 'event::CheckResult' message
[2024-09-06 10:58:29 -0400] debug/IcingaDB: Firing and forgetting query: 'XADD' 'icinga:runtime:state' 'MAXLEN' '~' '1000000' '*' 'runtime_type' ...
[2024-09-06 10:58:29 -0400] notice/Checkable: State Change: Checkable 'wamonitor2.iwco.com!Linux Update Status' soft state change from OK to CRITICAL detected.
[2024-09-06 10:58:29 -0400] notice/Checkable: Executing event handler 'test_event' for checkable 'wamonitor2.iwco.com!Linux Update Status'
[2024-09-06 10:58:29 -0400] notice/ApiListener: Sending message 'event::ExecuteCommand' to 'wamonitor2.iwco.com'
Just another tidbit - I ran strace against my running icinga2 processes following forks, and don't find the string 'test_event' anywhere, making me think that it's never actually being sent. Looking at the source file lib/icinga/checkable-event.cpp, I suspect it may have something to do with the if statement on line 52:
if (endpoint && !GetExtension("agent_check")) {
If that conditional is somehow false, no message gets sent to the listener to perform the check. That said, based on my digging of the code, it looks like the "agent_check" extension only get set in lib/icinga/clusterevents-check.cpp:ClusterEvents::ExecuteCheckFromQueue, but I don't know enough about icinga2's internals to understand if that's involved here or not. For reference, I'm running a master/satellite/endpoint setup, but the issue in question is happening on the master itself. Interestingly, I just tried it on both one of my satellites and one of my endpoints, and it works on both of them, so it's only on the master that the event handlers are failing (and, coincidentally, that's where I need them to run the most...)
It looks like this may be related to #9259, which looks like it's designed to make sure commands don't run on both masters in a HA situation. But it may also be preventing event handlers from running locally on a single host master...
Describe the bug
I have an event handler set up to restart a service when a particular check fails. It doesn't actually execute, no matter what I've tried. When I enable debuglog, I get log messages that it's executing the EventCommand, but the actual command never executes. I have used inotifywait to watch the directory with the script and nothing touches it, I have used strace to trace the Icinga2 daemon and it never tries to execute the command, it's as though the whole event handler system only exists as window-dressing with nothing underneath it. I even set up a test event handler to just log that it ran and nothing happens.
To Reproduce
Create the event command:
Expected behavior
The event handler should run (in the above case, it should create/append to /var/log/icinga2/test_event_handler.log.
Your Environment
Running on RHEL 8 using the Icinga2 packages from the Icinga software repository subscription.
I do run a Master/Satellite/Server setup, but the affected service and event handler are all on the master.