canonical / charm-openstack-service-checks

Collection of Nagios checks and other utilities that can be used to verify the operation of an OpenStack cluster
0 stars 2 forks source link

removing the unit does not unregister checks from nagios #88

Closed sudeephb closed 7 months ago

sudeephb commented 7 months ago

Not much to say, removing an openstack-service-checks unit does not clean up deployed nrpe checks on the nagios unit.


Imported from Launchpad using lp2gh.

sudeephb commented 7 months ago

(by aluria) This behavior is common on the OpenStack Charms, too. The expectation is that removing this charm would remove the machine (container?) where it is running.

What would be the expected behavior? Remove the checks deployed by this charm but leave the ones deployed by the nrpe-charm? Should the deployment of checks remove the old ones (having .cfg files in /etc/nagios/nrpe.d doesn't mean Nagios is using the commands defined on them)?

sudeephb commented 7 months ago

(by aieri) For a comparison, while developing the hw-health-charm I tried to make all actions reversible: install/uninstall tools, add/remove nrpe checks. But yeah, I can see how the assumption is that 1 charm == 1 instance, so removing the charm equals throwing away everything, so we don't necessarily have to care.

The problem here is however that removing (for example) openstack-service-checks/0 leaves openstack-service-checks-0-* checks in nagios itself (which will henceforth fail). Redeploying as openstack-service-checks/1 (presumably) registers new checks in nagios, but does not of course remove the old ones.

sudeephb commented 7 months ago

(by aluria) Hello Andrea. I have tested this by running the following: juju deploy cs:openstack-service-checks juju add-relation openstack-service-checks nrpe juju add-relation openstack-service-checks keystone

assumption is that nrpe<->nagios:monitors relation already exists, as well as a running OS env

I then ran: juju deploy cs:ubuntu --to openstack-service-checks/0

wait until ubuntu/0 is active/idle

juju remove-unit openstack-service-checks/0

I saw that the nrpe checks got removed from nagios, but ubuntu/0 still had all the nrpe configurations at /etc/nagios/nrpe.d/*.cfg.

Although not optimal, the nrpe server on ubuntu/0 only provides "check_*" commands but doesn't configure them in nagios.

I think the expectation from charm-openstack-service-checks would be to: 1) remove all files managed by the charm, via the nrpe-external-master relation 2) reload the nrpe server to unregister the "check_*" commands (so in the case someone (nagios) would try to check them, check_nrpe would return "undefined".

I think this bug also affects the nrpe-charm, so it takes care of the generic configured checks (check_load, check_procs, check_disk, etc.)

sudeephb commented 7 months ago

(by xavpaice) Using charm rev 30, I removed a unit and it did remove the checks.