Closed leeclemens closed 3 years ago
Did icinga2 log any messages that contain PIDs of these zombie processes?
Yes, but not for all of the PIDs that are currently defunct. Of the PIDs that are in the icinga2 logs, there are either 3 or 5 lines:
[2021-08-24 04:11:32 -0400] warning/Process: Killing process group 3953756 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_osd_df' '-C' '90' '-W' '80') after timeout of 60 seconds
[2021-08-24 04:11:32 -0400] warning/Process: Couldn't kill the process group 3953756 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_osd_df' '-C' '90' '-W' '80'): [errno 1] Operation not permitted
[2021-08-24 04:11:32 -0400] warning/PluginCheckTask: Check command for object 'localhost' (PID: 3953756, arguments: '/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_osd_df' '-C' '90' '-W' '80') terminated with exit code 128, output: <Timeout exceeded.>
[2021-08-24 08:02:48 -0400] warning/Process: Terminating process 503456 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_mds' '-f' 'myhost.nfs' '-n' 'localhost') after timeout of 60 seconds
[2021-08-24 08:02:48 -0400] warning/Process: Couldn't terminate the process 503456 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_mds' '-f' 'myhost.nfs' '-n' 'localhost'): [errno 1] Operation not permitted
[2021-08-24 08:02:54 -0400] warning/Process: Killing process group 503456 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_mds' '-f' 'myhost.nfs' '-n' 'localhost') after timeout of 66 seconds
[2021-08-24 08:02:54 -0400] warning/Process: Couldn't kill the process group 503456 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_mds' '-f' 'myhost.nfs' '-n' 'localhost'): [errno 1] Operation not permitted
[2021-08-24 08:02:54 -0400] warning/PluginCheckTask: Check command for object 'localhost!ceph_mds' (PID: 503456, arguments: '/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_mds' '-f' 'myhost.nfs' '-n' 'localhost') terminated with exit code 128, output: <Timeout exceeded.>
That might explain what's going on here: when Icinga can't kill the process, it probably gives up waiting for the child. If I remember correctly, Icinga only calls waitpid()
on specific PIDs, so if that happens, the zombie probably stays around forever.
@leeclemens Is https://github.com/Icinga/icinga2/issues/8723#issuecomment-822469119 a reasonable workaround for you? Icinga 2 then could kill the checks and they wouldn’t become zombies.
Btw: do they vanish on reload?
@Al2Klimov I'll test and report back. I would have the same concerns about it granting icinga too much permission to kill. In my case restarting the icinga2 service does cause the zombies to vanish. Using an Event Handler to restart icinga2 seems to be working 90% of the time.
I haven't had much luck getting this to occur since adding the Event Handler and subsequently removing it. Some other changes were made that likely fixed why it was occurring in the first place - but I think the concerns raised with the potential fix are significant.
Describe the bug
A number of defunct/zombie processes accumulate. There are 10-30 day (check @ 5 min intervals) and do not seem to be related to master reloads (as previous bugs/forums reference).
To Reproduce
I have only seem this issue with plugins where the
CheckCommand
'scommand
includessudo
.Define a
CheckCommand
similar to:ps -efH
Expected behavior
Icinga2 reaps child
CheckCommand
/plugin processesScreenshots
N/A
Your Environment
CentOS 7.9 Icinga2 2.13.0-1
* Version used (`icinga2 --version`):
``` # icinga2 --version icinga2 - The Icinga 2 network monitoring daemon (version: 2.13.0-1) Copyright (c) 2012-2021 Icinga GmbH (https://icinga.com/) License GPLv2+: GNU GPL version 2 or later* Operating System and version:
``` # uname -a Linux localhost 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux ``` ``` # cat /etc/centos-release CentOS Linux release 7.9.2009 (Core) ``` ``` # cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL=https://www.centos.org/ BUG_REPORT_URL=https://bugs.centos.org/ CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"```* Enabled features (`icinga2 feature list`):
```# icinga2 feature list Disabled features: compatlog debuglog elasticsearch gelf icingadb influxdb influxdb2 livestatus opentsdb perfdata statusdata syslog Enabled features: api checker command graphite ido-mysql mainlog notification ```* Config validation (`icinga2 daemon -C`):
zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes. N/AAdditional context
Add any other context about the problem here.