Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 579 forks source link

Zombie CheckCommand processes #8981

Closed leeclemens closed 3 years ago

leeclemens commented 3 years ago

Describe the bug

A number of defunct/zombie processes accumulate. There are 10-30 day (check @ 5 min intervals) and do not seem to be related to master reloads (as previous bugs/forums reference).

To Reproduce

I have only seem this issue with plugins where the CheckCommand's command includes sudo.

  1. Define a CheckCommand similar to:

    object CheckCommand "ceph_mgr" {
    import "plugin-check-command"
    
    command = [
        "/usr/bin/sudo",
        PluginContribDir + "/check_ceph_mgr"
    ]
    }
  2. Wait for it to run, seems to happen 10-30 times a day
  3. ps -efH
    icinga   3521095       1  0 Aug10 ?        00:03:44   /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e 
    icinga    929272 3521095  0 Aug16 ?        00:05:57     /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -
    icinga    929281  929272  0 Aug16 ?        00:00:25       /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio
    root     1124248  929281  0 Aug16 ?        00:00:00         [sudo] <defunct>
    root     1124412  929281  0 Aug16 ?        00:00:00         [sudo] <defunct>
    root     1124927  929281  0 Aug16 ?        00:00:00         [sudo] <defunct>
    root     1124998  929281  0 Aug16 ?        00:00:00         [sudo] <defunct>
    root     1125732  929281  0 Aug16 ?        00:00:00         [sudo] <defunct>
    root     1125752  929281  0 Aug16 ?        00:00:00         [sudo] <defunct>
    root     1126253  929281  0 Aug16 ?        00:00:00         [sudo] <defunct>
    root     1126370  929281  0 Aug16 ?        00:00:00         [sudo] <defunct>
    root     1126395  929281  0 Aug16 ?        00:00:00         [sudo] <defunct>
    root     1126877  929281  0 Aug16 ?        00:00:00         [sudo] <defunct>
    root     1127118  929281  0 Aug16 ?        00:00:00         [sudo] <defunct>

Expected behavior

Icinga2 reaps child CheckCommand/plugin processes

Screenshots

N/A

Your Environment

CentOS 7.9 Icinga2 2.13.0-1

* Version used (`icinga2 --version`): ``` # icinga2 --version icinga2 - The Icinga 2 network monitoring daemon (version: 2.13.0-1) Copyright (c) 2012-2021 Icinga GmbH (https://icinga.com/) License GPLv2+: GNU GPL version 2 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. System information: Platform: CentOS Linux Platform version: 7 (Core) Kernel: Linux Kernel version: 3.10.0-1160.36.2.el7.x86_64 Architecture: x86_64 Build information: Compiler: GNU 4.8.5 Build host: runner-hh8q3bz2-project-322-concurrent-0 OpenSSL version: OpenSSL 1.0.2k-fips 26 Jan 2017 Application information: General paths: Config directory: /etc/icinga2 Data directory: /var/lib/icinga2 Log directory: /var/log/icinga2 Cache directory: /var/cache/icinga2 Spool directory: /var/spool/icinga2 Run directory: /run/icinga2 Old paths (deprecated): Installation root: /usr Sysconf directory: /etc Run directory (base): /run Local state directory: /var Internal paths: Package data directory: /usr/share/icinga2 State path: /var/lib/icinga2/icinga2.state Modified attributes path: /var/lib/icinga2/modified-attributes.conf Objects path: /var/cache/icinga2/icinga2.debug Vars path: /var/cache/icinga2/icinga2.vars PID path: /run/icinga2/icinga2.pid ```
* Operating System and version: ``` # uname -a Linux localhost 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux ``` ``` # cat /etc/centos-release CentOS Linux release 7.9.2009 (Core) ``` ``` # cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL=https://www.centos.org/ BUG_REPORT_URL=https://bugs.centos.org/ CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"```
* Enabled features (`icinga2 feature list`): ```# icinga2 feature list Disabled features: compatlog debuglog elasticsearch gelf icingadb influxdb influxdb2 livestatus opentsdb perfdata statusdata syslog Enabled features: api checker command graphite ido-mysql mainlog notification ```
# icinga2 daemon -C
[2021-08-23 13:18:25 -0400] information/cli: Icinga application loader (version: 2.13.0-1)
[2021-08-23 13:18:25 -0400] information/cli: Loading configuration file(s).
[2021-08-23 13:18:26 -0400] information/ConfigItem: Committing config item(s).
[2021-08-23 13:18:26 -0400] information/ApiListener: My API identity: wsc-salt01.cyber-center.com
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 1 GraphiteWriter.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 1 NotificationComponent.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 1 CheckerComponent.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 488 Services.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 44 Zones.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 6 HostGroups.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 1 IcingaApplication.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 43 Hosts.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 42 Endpoints.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 1 FileLogger.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 3 ApiUsers.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 266 CheckCommands.
[2021-08-23 13:18:26 -0400] information/ConfigItem: Instantiated 1 ApiListener.
[2021-08-23 13:18:26 -0400] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2021-08-23 13:18:26 -0400] information/cli: Finished validating the configuration file(s).

Additional context

Add any other context about the problem here.

julianbrost commented 3 years ago

Did icinga2 log any messages that contain PIDs of these zombie processes?

leeclemens commented 3 years ago

Yes, but not for all of the PIDs that are currently defunct. Of the PIDs that are in the icinga2 logs, there are either 3 or 5 lines:

[2021-08-24 04:11:32 -0400] warning/Process: Killing process group 3953756 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_osd_df' '-C' '90' '-W' '80') after timeout of 60 seconds
[2021-08-24 04:11:32 -0400] warning/Process: Couldn't kill the process group 3953756 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_osd_df' '-C' '90' '-W' '80'): [errno 1] Operation not permitted
[2021-08-24 04:11:32 -0400] warning/PluginCheckTask: Check command for object 'localhost' (PID: 3953756, arguments: '/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_osd_df' '-C' '90' '-W' '80') terminated with exit code 128, output: <Timeout exceeded.>
[2021-08-24 08:02:48 -0400] warning/Process: Terminating process 503456 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_mds' '-f' 'myhost.nfs' '-n' 'localhost') after timeout of 60 seconds
[2021-08-24 08:02:48 -0400] warning/Process: Couldn't terminate the process 503456 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_mds' '-f' 'myhost.nfs' '-n' 'localhost'): [errno 1] Operation not permitted
[2021-08-24 08:02:54 -0400] warning/Process: Killing process group 503456 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_mds' '-f' 'myhost.nfs' '-n' 'localhost') after timeout of 66 seconds
[2021-08-24 08:02:54 -0400] warning/Process: Couldn't kill the process group 503456 ('/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_mds' '-f' 'myhost.nfs' '-n' 'localhost'): [errno 1] Operation not permitted
[2021-08-24 08:02:54 -0400] warning/PluginCheckTask: Check command for object 'localhost!ceph_mds' (PID: 503456, arguments: '/usr/bin/sudo' '/usr/lib64/nagios/plugins/check_ceph_mds' '-f' 'myhost.nfs' '-n' 'localhost') terminated with exit code 128, output: <Timeout exceeded.>
julianbrost commented 3 years ago

That might explain what's going on here: when Icinga can't kill the process, it probably gives up waiting for the child. If I remember correctly, Icinga only calls waitpid() on specific PIDs, so if that happens, the zombie probably stays around forever.

Al2Klimov commented 3 years ago

@leeclemens Is https://github.com/Icinga/icinga2/issues/8723#issuecomment-822469119 a reasonable workaround for you? Icinga 2 then could kill the checks and they wouldn’t become zombies.

Al2Klimov commented 3 years ago

Btw: do they vanish on reload?

leeclemens commented 3 years ago

@Al2Klimov I'll test and report back. I would have the same concerns about it granting icinga too much permission to kill. In my case restarting the icinga2 service does cause the zombies to vanish. Using an Event Handler to restart icinga2 seems to be working 90% of the time.

leeclemens commented 3 years ago

I haven't had much luck getting this to occur since adding the Event Handler and subsequently removing it. Some other changes were made that likely fixed why it was occurring in the first place - but I think the concerns raised with the potential fix are significant.