NagiosEnterprises / ncpa

Nagios Cross-Platform Agent
Other
176 stars 95 forks source link

NCPA service failed after reboot - RHEL #1068

Closed ramseydave closed 3 months ago

ramseydave commented 7 months ago

In some cases, after a reboot of a RHEL server, the NCPA service is failed with the below error code;

systemctl status ncpa

● ncpa.service - NCPA Loaded: loaded (/usr/lib/systemd/system/ncpa.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2023-12-05 08:24:44 GMT; 1s ago ... ncpa[1321]: 2023-12-05 08:17:42,515 parent WARNING Daemon - check_pid() - Another instance is already running (pid 1299) ...

# As per below, another process is using the PID. So it seems that the PID file for NCPA is holding on to the PID and not refreshing on reboot. ps -ef | grep 1299 root 1299 1 0 11:32 ? 00:00:00 /usr/lib/systemd/systemd-logind

tannermsmith1 commented 7 months ago

Same failure mode for me after some automated reboots in our environment for CentOS 7 and CentOS 8 Stream hosts.

parent INFO Daemon - start() - Initialize and run the daemon
ncpa[1053]: 2023-12-07 04:01:46,463 parent WARNING Daemon - check_pid() - Another instance is already running (pid 1047)
ncpa[1053]: Daemon - check_pid() - Another instance is already running (pid 1047)
ncpa[1053]: ***** Starting NCPA version:  3.0.0
systemd[1]: ncpa.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: ncpa.service: Failed with result 'exit-code'.
ne-bbahn commented 7 months ago

Did any of you modify your ncpa.service to use --start instead of the default -n?

soxfor commented 5 months ago

Facing same issue, but on SLES 15. The unit was not changed from the default -n argument.

ncpa.service - NCPA
     Loaded: loaded (/usr/lib/systemd/system/ncpa.service; enabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Fri 2024-01-19 08:55:44 UTC; 6h ago
       Docs: https://www.nagios.org/ncpa
    Process: 1493 ExecStart=/usr/local/ncpa/ncpa -n (code=exited, status=1/FAILURE)
   Main PID: 1493 (code=exited, status=1/FAILURE)

Jan 19 08:55:44 host ncpa[1493]: 2024-01-19 08:55:44,649 root INFO main - Python version: 3.11.6 (main, Nov 20 2023, 07:42:02) [GCC 10.2.1 20210130 (Red Hat 10.2.1-11)]
Jan 19 08:55:44 host ncpa[1493]: 2024-01-19 08:55:44,651 root INFO main - SSL version: OpenSSL 3.0.8 7 Feb 2023
Jan 19 08:55:44 host ncpa[1493]: 2024-01-19 08:55:44,652 root INFO main - ZLIB version: 1.3
Jan 19 08:55:44 host ncpa[1493]: 2024-01-19 08:55:44,652 parent INFO Daemon - start() - Initialize and run the daemon
Jan 19 08:55:44 host ncpa[1493]: 2024-01-19 08:55:44,657 parent WARNING Daemon - check_pid() - Another instance is already running (pid 1493)

But I've set an override to specify the PID file created by NCPA.

host:~ # cat /usr/lib/systemd/system/ncpa.service
[Unit]
Description=NCPA
Documentation=https://www.nagios.org/ncpa
After=network.target local-fs.target

[Service]
ExecStart=/usr/local/ncpa/ncpa -n

[Install]
WantedBy=multi-user.target
host:~ # cat /etc/systemd/system/ncpa.service.d/override.conf
[Service]
PIDFile=/usr/local/ncpa/var/run/ncpa.pid

Haven't yet rebooted the server to confirm if it starts OK after this or not.

GldRush98 commented 5 months ago

Thanks Soxfor. I just tested this on my Fedora 39 box. It fixed the ncpa starting problem for me, I can now reboot and ncpa actually starts up correctly. Thanks for that work around :)

Edit: after thinking about this a while, I can't confirm this is a fix actually. This problem seems to be infrequent. Most of the time everything works fine. Maybe some clean up isn't happening every time at shutdown or something?

soxfor commented 5 months ago

@GldRush98 no problem. As a workaround it works, server booted and no error on NCPA service. Confirmed on my side as well.

This could be an error on the NCPA stop/start steps, so although this isn't a fix per-say it does provide a way of having a successful service start in the meantime.

mbbv commented 5 months ago

My understanding is that NCPA fails to start when the PID file is present and the PID in it corresponds to an existing process. If the file exists but the PID in the file is not in use, the process starts fine. This explains why most of the time the service starts up fine, but sometimes it doesn't.

Adding this to /usr/lib/systemd/system/ncpa.service seems to fix the issue: ExecStop=/usr/local/ncpa/ncpa --stop

ne-bbahn commented 4 months ago

Fixed in: https://github.com/NagiosEnterprises/ncpa/pull/1107 https://github.com/NagiosEnterprises/ncpa/pull/1115 https://github.com/NagiosEnterprises/ncpa/pull/1121