Openstack Integrator should have nrpe checks that monitor status of openstack components supporting k8s workloads

canonical / charm-openstack-service-checks

Collection of Nagios checks and other utilities that can be used to verify the operation of an OpenStack cluster

0 stars 4 forks source link

Openstack Integrator should have nrpe checks that monitor status of openstack components supporting k8s workloads #50

Closed sudeephb closed 9 months ago

sudeephb commented 9 months ago

As is mentioned in lp#1853668, it is possible that there can be issues on the backend of openstack underlay that can cause odd/failing service access for kubernetes workloads.

The openstack integrator charm should have monitoring hooks added for nrpe-external-master that provide checks of loadbalancer/FIP/networks/etc, anything that is manged by the openstack-integrator, on behalf of kubernetes to ensure that any openstack components configured by the integrator are monitored and status reported via nagios.

For instance, if there is a loadbalancer that is running in support of a service endpoint, it's status and its loadbalancer pool member statuses should be monitored and reported up to kubernetes and/or nagios in some way that can be exposed to operators of multi-tiered clouds.

Imported from Launchpad using lp2gh.

date created: 2022-06-20T09:12:30Z
owner: rgildein
assignee: rgildein
the launchpad url

sudeephb commented 9 months ago

(by rgildein) I think the best approach to fix this bug would be to add layer:nagios and create a python script in the templates folder. Each py file will contain one of the checks.

Here I provide my idea for nrpe check for all OpenStack networks should look like. When the nrpe-external-master.available flag exists, the check_openstack_networks.py file will be installed as a nagios plugin. This file checks all OpenStack networks to see if they are in the ACTIVE state. If the network is in the DOWN state, raises a warning, and if another problem occurs (problem with parsing networks from OpenStack output, etc.), raises a critical error.

After verifying the correctness of my approach, I will provide more information about other checks.

sudeephb commented 9 months ago

(by rgildein) WIP PR at https://github.com/juju-solutions/charm-openstack-integrator/pull/43

In my design, I changed only one thing, and that was creating individual py files for checks to reuse functions.

sudeephb commented 9 months ago

(by aluria) The approach described in #1 has been slightly modified. When a Neutron port reports "DOWN", the nagios alert raised is CRITICAL, not warning.

The PR mentioned in #2 is ready for review. As mentioned in my last comment [1], I think the use of python-openstackclient, installed via layer.yaml, will need to be reviewed. The nrpe script(s) are able to use native python libs, but it would break the approach taken until now (use a snap from the snapstore of deployed via Juju resources, in case no Internet access exists).

https://github.com/juju-solutions/charm-openstack-integrator/pull/43#issuecomment-778240754

sudeephb commented 9 months ago

(by aluria) The bug would not be complete, in case of approval of the previous PR, because a LB nrpe check is still missing (WIP).

sudeephb commented 9 months ago

(by rgildein) After the discussion, we decided that NRPE checks should not be a part of charm-openstack-integrator. Instead, charm-openstack-service-checks should be used. Since it already has check for Octavia LB, I will add only check for OpenStack resources.