Closed sudeephb closed 9 months ago
(by rgildein)
I think the best approach to fix this bug would be to add layer:nagios
and create a python script in the templates folder. Each py file will contain one of the checks.
Here I provide my idea for nrpe check for all OpenStack networks should look like.
When the nrpe-external-master.available
flag exists, the check_openstack_networks.py
file will be installed as a nagios plugin. This file checks all OpenStack networks to see if they are in the ACTIVE state. If the network is in the DOWN state, raises a warning, and if another problem occurs (problem with parsing networks from OpenStack output, etc.), raises a critical error.
After verifying the correctness of my approach, I will provide more information about other checks.
(by rgildein) WIP PR at https://github.com/juju-solutions/charm-openstack-integrator/pull/43
In my design, I changed only one thing, and that was creating individual py files for checks to reuse functions.
(by aluria) The approach described in #1 has been slightly modified. When a Neutron port reports "DOWN", the nagios alert raised is CRITICAL, not warning.
The PR mentioned in #2 is ready for review. As mentioned in my last comment [1], I think the use of python-openstackclient, installed via layer.yaml, will need to be reviewed. The nrpe script(s) are able to use native python libs, but it would break the approach taken until now (use a snap from the snapstore of deployed via Juju resources, in case no Internet access exists).
As is mentioned in lp#1853668, it is possible that there can be issues on the backend of openstack underlay that can cause odd/failing service access for kubernetes workloads.
The openstack integrator charm should have monitoring hooks added for nrpe-external-master that provide checks of loadbalancer/FIP/networks/etc, anything that is manged by the openstack-integrator, on behalf of kubernetes to ensure that any openstack components configured by the integrator are monitored and status reported via nagios.
For instance, if there is a loadbalancer that is running in support of a service endpoint, it's status and its loadbalancer pool member statuses should be monitored and reported up to kubernetes and/or nagios in some way that can be exposed to operators of multi-tiered clouds.
Imported from Launchpad using lp2gh.
date created: 2022-06-20T09:12:30Z
owner: rgildein
assignee: rgildein
the launchpad url