canonical / charm-nrpe

A subordinate charm used to configure nrpe (Nagios Remote Plugin Executor)
Apache License 2.0
1 stars 6 forks source link

check space generate only part of the checks definition. #71

Closed sudeephb closed 10 months ago

sudeephb commented 10 months ago

Having two nested children sections in lsblk output caused charm to skip several definitions.

Here is the json, with this json only 6 check generated.

{ "blockdevices": [ {"name": "loop0", "maj:min": "7:0", "rm": "0", "size": "9M", "ro": "1", "type": "loop", "mountpoint": "/snap/canonical-livepatch/146"}, {"name": "loop2", "maj:min": "7:2", "rm": "0", "size": "113.9M", "ro": "1", "type": "loop", "mountpoint": "/snap/core/13308"}, {"name": "loop3", "maj:min": "7:3", "rm": "0", "size": "9M", "ro": "1", "type": "loop", "mountpoint": "/snap/canonical-livepatch/138"}, {"name": "loop4", "maj:min": "7:4", "rm": "0", "size": "114M", "ro": "1", "type": "loop", "mountpoint": "/snap/core/13425"}, {"name": "sda", "maj:min": "8:0", "rm": "0", "size": "372.6G", "ro": "0", "type": "disk", "mountpoint": null, "children": [ {"name": "sda1", "maj:min": "8:1", "rm": "0", "size": "572M", "ro": "0", "type": "part", "mountpoint": "/boot/efi"}, {"name": "sda2", "maj:min": "8:2", "rm": "0", "size": "18.6G", "ro": "0", "type": "part", "mountpoint": "/"}, {"name": "sda3", "maj:min": "8:3", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/home"}, {"name": "sda4", "maj:min": "8:4", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/tmp"}, {"name": "sda5", "maj:min": "8:5", "rm": "0", "size": "37.3G", "ro": "0", "type": "part", "mountpoint": "/var"}, {"name": "sda6", "maj:min": "8:6", "rm": "0", "size": "93.1G", "ro": "0", "type": "part", "mountpoint": "/var/log"}, {"name": "sda7", "maj:min": "8:7", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/var/tmp"}, {"name": "sda8", "maj:min": "8:8", "rm": "0", "size": "14G", "ro": "0", "type": "part", "mountpoint": "/var/log/audit"}, {"name": "sda9", "maj:min": "8:9", "rm": "0", "size": "186.3G", "ro": "0", "type": "part", "mountpoint": null, "children": [ {"name": "bcache0", "maj:min": "252:0", "rm": "0", "size": "7.3T", "ro": "0", "type": "disk", "mountpoint": "/var/lib"} ] } ] }, {"name": "sdb", "maj:min": "8:16", "rm": "0", "size": "7.3T", "ro": "0", "type": "disk", "mountpoint": null, "children": [ {"name": "bcache0", "maj:min": "252:0", "rm": "0", "size": "7.3T", "ro": "0", "type": "disk", "mountpoint": "/var/lib"} ] } ] }


Imported from Launchpad using lp2gh.

sudeephb commented 10 months ago

(by nkoltsov) Second children section is not handled correctly so None is returned as mountpoint which cause executing replace agains None object in https://git.launchpad.net/charm-nrpe/tree/hooks/nrpe_helpers.py#n570

sudeephb commented 10 months ago

(by andreashamacher) likely related to 1979860

sudeephb commented 10 months ago

(by aieri) hi Nikita, I have tried running your json lsblk output through the get_partitions_to_check() code and I cannot reproduce the issue you describe. .get("children", []) is run for every blockdevice, and I thus get 9 mountpoints by line 842 (although "/boot/efi" is subsequently skipped):

In [9]: partitions_to_check
Out[9]: 
{'/',
 '/boot/efi',
 '/home',
 '/tmp',
 '/var',
 '/var/lib',
 '/var/log',
 '/var/log/audit',
 '/var/tmp'}

I suggest running the config-changed hook through debug-hooks or debug-code to identify where the failure occurs in your environment.

Please mark the bug back to new once you have more information.

Also jfyi this upcoming merge proposal will slightly alter the listing of mountpoints that need to be checked: https://code.launchpad.net/~nikolay.vinogradov/charm-nrpe/+git/charm-nrpe/+merge/438754

sudeephb commented 10 months ago

(by aieri) Also please ensure you are running the latest stable charm (rev 97 as of this writing) in case this bug has already been fixed.

sudeephb commented 10 months ago

(by vultaire) I may be hitting the same issue here. I'm on revision 96 (deliberately for the sake of alignment with another cloud), but I've pulled the deployed sources for rev 96 and compared to a downloaded rev 97 nrpe charm and the sources appear the same.

"lsblk -J" output is as follows:

lsblk -J

{ "blockdevices": [ {"name": "loop0", "maj:min": "7:0", "rm": "0", "size": "55.6M", "ro": "1", "type": "loop", "mountpoint": "/snap/core18/2714"}, {"name": "loop1", "maj:min": "7:1", "rm": "0", "size": "6M", "ro": "1", "type": "loop", "mountpoint": "/snap/prometheus-grok-exporter/1"}, {"name": "loop2", "maj:min": "7:2", "rm": "0", "size": "116.8M", "ro": "1", "type": "loop", "mountpoint": "/snap/core/14784"}, {"name": "loop3", "maj:min": "7:3", "rm": "0", "size": "21.8M", "ro": "1", "type": "loop", "mountpoint": "/snap/prometheus-libvirt-exporter/36"}, {"name": "loop4", "maj:min": "7:4", "rm": "0", "size": "9M", "ro": "1", "type": "loop", "mountpoint": "/snap/canonical-livepatch/164"}, {"name": "loop5", "maj:min": "7:5", "rm": "0", "size": "116.8M", "ro": "1", "type": "loop", "mountpoint": "/snap/core/14946"}, {"name": "loop6", "maj:min": "7:6", "rm": "0", "size": "55.6M", "ro": "1", "type": "loop", "mountpoint": "/snap/core18/2721"}, {"name": "sda", "maj:min": "8:0", "rm": "0", "size": "894.2G", "ro": "0", "type": "disk", "mountpoint": null, "children": [ {"name": "crypt-", "maj:min": "253:0", "rm": "0", "size": "894.2G", "ro": "0", "type": "crypt", "mountpoint": "/var/lib/nova/instances"} ] }, {"name": "sdb", "maj:min": "8:16", "rm": "0", "size": "447.1G", "ro": "0", "type": "disk", "mountpoint": null, "children": [ {"name": "sdb1", "maj:min": "8:17", "rm": "0", "size": "572M", "ro": "0", "type": "part", "mountpoint": "/boot/efi"}, {"name": "sdb2", "maj:min": "8:18", "rm": "0", "size": "18.6G", "ro": "0", "type": "part", "mountpoint": "/"}, {"name": "sdb3", "maj:min": "8:19", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/home"}, {"name": "sdb4", "maj:min": "8:20", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/tmp"}, {"name": "sdb5", "maj:min": "8:21", "rm": "0", "size": "223.5G", "ro": "0", "type": "part", "mountpoint": "/var"}, {"name": "sdb6", "maj:min": "8:22", "rm": "0", "size": "93.1G", "ro": "0", "type": "part", "mountpoint": "/var/log"}, {"name": "sdb7", "maj:min": "8:23", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/var/tmp"}, {"name": "sdb8", "maj:min": "8:24", "rm": "0", "size": "93.1G", "ro": "0", "type": "part", "mountpoint": "/var/log/audit"} ] } ] }

Upon upgrading to this version of the nrpe charm, I have 4 check_space alerts in nagios which return an UNKNOWN status with status message "NRPE: Command 'checkspace$(path)' not defined.

space_check is set to the default, which is:

default: |-
  check: auto
  auto_params: -w 25% -c 20% -K 5%

With that, I get these checks coming through OK:

check_space_home (/home) check_space_root (/) check_space_tmp (/tmp) check_space_var (/var)

While these checks appear to not be defined properly:

check_space_var_lib_nova_instances check_space_var_log check_space_var_log_audit check_space_var_tmp

sudeephb commented 10 months ago

(by vultaire) Hrm... I'm not sure this exactly matches the description after re-reading the original bug. But maybe it's pointing at the same root cause? I'm not sure.

sudeephb commented 10 months ago

(by vultaire) Noting the above - all my missing checks are sub-checks of an existing parent. And to be clear - the checks are generated and shared via relation data - and I've confirmed that get_partitions_to_check would be returning the right set of values - but the files on disk for the subdirs are either not being created or are being cleaned up by mistake, perhaps after being created previously.

sudeephb commented 10 months ago

(by vultaire) I'm going to go out on a limb and suggest the problem is caused by the behavior of render_nrped_files in hooks/nrpe_utils.py. What it does is, prior to creating files for a check, it removes all "matching files" the check may have had.

That list of matching files is created in hooks/nrpe_helpers.py in SubordinateCheckDefinitions.init(). For each check created, a set of "matching files" is created using these patterns:

This means that, for the check_space checks, a check on /var (i.e. check_space_var), would match on files /etc/nagios/nrpe.d/check_space_var.cfg as well as the glob /etc/nagios/nrpe.d/check_spacevar*.cfg, which would also match the files used by the subdir checks.

At that point, it basically becomes a quirk of ordering. If we want this type of cleanup behavior, then we need to ensure that the checks are properly ordered so that if one check deletes the files from other checks, the other checks will then re-create them afterwards.

I believe this is where the bug is happening: get_partitions_to_check() is returning a set, which is inherently unordered. In order for this to work reliably, this needs to be reworked to return a list, with subordinate checks listed after the parent checks. ...If I'm not mistaken, this may be as simple as just wrapping the result of that function in list(sorted()).

sudeephb commented 10 months ago

(by aieri) Thanks, this is enough for us to set up a reproducer and a regression test case. FTR, I think relying on ordering would be quite fragile, I'd rather look into making the removal of old checks more precise, possibly by altering the string substitution that causes character "_" to have multiple meanings.

sudeephb commented 10 months ago

(by vultaire) I'm all in favor for that; I agree with your sentiment.

sudeephb commented 10 months ago

(by aieri) One thing of note is that with the release of the grafana-agent machine charm to the edge channel (which exposes filesystem metrics via node-exporter by default), development work should concentrate on adding suitable alerts to our metrics-based solution. I'll keep this bug as medium priority within the context of the nrpe charm, although it should rather be considered a low in the wider scope of charmed monitoring solutions

sudeephb commented 10 months ago

(by aieri) I've filed https://github.com/canonical/grafana-agent-operator/issues/72 as a related issue as the grafana-agent charm does not currently ship an alert rule that covers space checks.

sudeephb commented 10 months ago

(by przemeklal) I'm on revision 97, also affected by this bug.

sudeephb commented 10 months ago

(by marcusboden) To anyone who needs a quick fix immediately (can't wait for an upgrade, or can't upgrade), the following will fix it:

juju run -a nrpe -- sed -i 's/services.manager.reconfigure_services("nrpe-config")/services.get_manager().reconfigure_services("nrpe-config")/' actions/ack-reboot*

sudeephb commented 10 months ago

(by peterctl) I believe the above message should be moved to bug LP#2008046 https://bugs.launchpad.net/charm-nrpe/+bug/2008046

sudeephb commented 10 months ago

(by aieri) This fix is now available in revision 110 (channel edge for now)

sudeephb commented 10 months ago

(by nkoltsov) This bug is not fixed actually, in the discussion we located another issue and fixed it. However nested children sections are still not handled correctly

I noticed this bug in the code, but yes it was not the root cause in the reported case, but it is a valid issue. Here is the link to the branch with unit test which could be used for reproduce the issue. https://code.launchpad.net/~nkoltsov/charm-nrpe/+git/charm-nrpe/+ref/lp1986654

And output was captured from the real customer environment

sudeephb commented 10 months ago

(by aieri) Available in edge (rev 111)