Closed ne-jmichaelson closed 4 months ago
A couple of things so far:
I was having doubts about requiring an external application, as NCPA ships with "everything you need" to monitor whatever NCPA gives you. This would break from that. On top of it, permissions need to be mucked with so that the nagios user can run smartctl. It's not so much that it needs permission to run smartctl, but it needs the setuid (which you did call out) so it can access the disks. But, at least on CentOS, the default permissions for smartctl give access to everyone to read and execute. Without changing that, we'd effectively be giving anyone on the system permission to become root for running that one application, which seems scary to me.
I don't want to dump on this idea. So a possible solution to all 3 issues here might be to use pySMART (insert "shop S-Mart" joke here). The project looks to be backed by the TrueNAS team, so likely it will continue to receive updates. Assuming it is not just going out and executing smartctl wherever it is found, it might give us an OS agnostic way to get SMART metrics. https://pypi.org/project/pySMART/
Tests on a physical host look good. A test VM didn't show any SMART data, but that's not too surprising. We had talked IRL regarding wanting to add a few more features to this:
Did you want to add those as part of this PR, or add those features in a following PR?
output = subprocess.check_output("ls -l /dev/disk/by-id/ | grep -v wwn | grep -v '\-part' | tr -s ' ' | sed 's/\.\.\///g'", shell=True, universal_newlines=True).split('\n')
This is a bit problematic, don't you think?
You'll have to find a better way to filter for physical disks, otherwise you'll catch device-mapper devices, like LUKS decrypted devices, in the list.
Instead of parsing /sys
yourself, maybe look into a python library that does it for you?
I just found https://github.com/truenas/py-SMART/blob/master/pySMART/smartctl.py#L191 which uses:
$ smartctl --scan-open
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
/dev/nvme1 -d nvme # /dev/nvme1, NVMe device
Maybe implement that as backend instead?
However, what I didn't like was that they use --scan-open
, when --scan
exists, as the former opens each device, thereby probably bringing it out of a sleep state - I haven't tested that.
From the man page of smartctl
:
--scan Scans for devices and prints each device name, device type and protocol ([ATA] or [SCSI]) info. May be used in conjunction with '-d TYPE' to restrict the scan to a specific TYPE. See also info about platform specific device scan and the DEVICESCAN directive on smartd(8) man page.
--scan-open
Same as --scan, but also tries to open each device before printing device info. The device open may change the device type due to autodetection (see also '-d test').This option can be used to create a draft smartd.conf file. All options after '--' are appended to each output line. For example: smartctl --scan-open -- -a -W 4,45,50 -m admin@work > smartd.conf
Multiple '-d TYPE' options may be specified with '--scan[-open]' to combine the scan results of more than one TYPE.
One caveat at the moment is that the smartctl executable has to be executable by the nagios user that NCPA is running as. This is most easily done by adding the nagios user to a group, giving that group execute permissions on /usr/sbin/smartctl, and setting the setuid bit on the executable