NagiosEnterprises / ncpa

Nagios Cross-Platform Agent
Other
177 stars 95 forks source link

NCPA doesn't properly track PID for stop/restart #921

Open MrPippin66 opened 1 year ago

MrPippin66 commented 1 year ago

NCPA agent = 2.3.1 OS = RHEL 7.9

In the case where NCPA wasn't cleanly shutdown on Unix/Linux, NCPA doesn't properly check the status for stopping/restarting the agent from the the data in /usr/local/ncpa/var/run.

In our case, when trying to stop/restart the agent, which had not properly been shutdown to a system crash, caused it to kill the pacemaker process, which happened to use the same PID after system restart.

This caused a failover event.

The agents in this case need to validate the process is actually the referenced agent before killing it.

However, for systemd based systems on Linux (which is practically all Linux distributions at this point), it should be using systemd integration to manage this, instead of using a PID file.

MrPippin66 commented 1 year ago

Though not a complete fix, I think most of this can be resolved for systemd based Linuxes by leveraging "tmpfiles", which is how the NRPE package manages part of there setup.

So...essentially, add the file "ncpa.conf" to /usr/lib/tmpfiles.d with below entries:

D /usr/local/ncpa/var/run 0775 root nagios -

pittagurneyi commented 1 year ago

While we are at it, we should probably take a hard look at the lack of hardening of the systemd unit files:

https://github.com/NagiosEnterprises/ncpa/blob/master/agent/build_resources/ncpa_passive.service https://github.com/NagiosEnterprises/ncpa/blob/master/agent/build_resources/ncpa_listener.service

One of the first search results when looking for systemd hardening was:

https://www.ctrl.blog/entry/systemd-service-hardening.html

As it has been a year since the last time I wrote systemd unit files, I would have to take a closer look myself, but the options

PrivateDevices=true
ProtectControlGroups=true
ProtectHome=true
ProtectKernelTunables=true
ProtectSystem=full
RestrictSUIDSGID=true

seem to be a good starting point. ProtectSystem makes /boot, /etc, and /usr all read-only. We'll now have to figure out if we can't make ncpa use default tmp and run directories, which would then be private as well, and not placed under /usr/local .

From that link:

You can take it one step further and manually specify ReadWritePaths, ReadOnlyPaths, and InaccessiblePaths. A utility like strace -e trace=%file program can help you identify what paths a program requires. Requirements will change over time so a setup like this requires ongoing monitoring.

You can go even further and change the root directory with the RootDirectory directive. This works like the traditional chroot command and lets you set up a mock file system and launch the service inside it.

So if ncpa is fully self-contained then perhaps we can use RootDirectory to restrict everything to the /usr/local/ncpa folder. The tmp/run directories should still be private and tmpfs.

I don't really believe it would need access to other services outside of /usr/local/ncpa, that might have sockets in global /run .

EDIT: or /var/run

But the monitoring stuff would then need access again I think, at least device access, so perhaps PrivateDevices would have to be turned off or sanely restricted. For getting this sorted I'd need much more knowledge of how exactly ncpa does its work. However, I don't really like unrestricted root-running processes, like at all.

The moment the run directory becomes tmpfs mounted, the pid file won't be present upon reboot.

In general I don't really like the way ncpa fits into the filesystem. So for ncpa 3.0 I'd ask that someone takes a look at these questions and finds a solution.

ericloyd commented 1 year ago

Can I just say that I'm almost in tears with the new dedication on behalf of Nagios Enterprises to address long-standing issues with NCPA? Thanks!

pittagurneyi commented 1 year ago

Um, just to clear something up, I have nothing to do with Nagios Enterprises, like at all.

ericloyd commented 1 year ago

I know, @pittagurneyi. I'm just saying "in general, how happy I am that Nagios Enterprises has stepped up to start taking care of this piece of software." I'll agree that this was not the best place to say that. :-)

pittagurneyi commented 1 year ago

All good, I was just confused for a second there and wanted to clear it up, should that be necessary.

I'm quite happy that there is development again, so I'm poking here and there and hope they get it into a form where I don't have to worry about running it on my systems :-)

MrPippin66 commented 1 year ago

@pittagurneyi I'd suggest creating new issue/feature request for your comments above

pittagurneyi commented 1 year ago

I understand and will think about it, but in the end, the tracking of the PID is an essential part of the init/systemd handling of NCPA. I placed it here, because instead of only resolving the specific issue you encountered, it should probably be fixed by solving the underlying problem, which is, from my understanding, bad process management.

So, what do you think? Should this issue be renamed to solving the general issue or do you want me to create a new issue?

MrPippin66 commented 1 year ago

I already referenced that the correct way for systemd managed systems is use the underlying systemd infrastructure for managing daemons. Using "pid" files is not the proper way in that case.

MrPippin66 commented 1 year ago

I've not seen any updates on this, nor any idea if a solution is being pursued.

I would suggest the most expedient solution for systemd managed systems is to just add a tmpfiles configuration file to remove the NCPA pid files on boot.

Example:

/usr/lib/tmpfiles.d/ncpa.conf

# Remove NCPA PID files
r! /usr/local/ncpa/var/run/ncpa_listener.pid
r! /usr/local/ncpa/var/run/ncpa_passive.pid

This isn't a full solution, since you can still have problems if during runtime, the service goes down, and by the time you re-cycle it another process also has by chance used the same PID.