canonical / charm-openstack-service-checks

Collection of Nagios checks and other utilities that can be used to verify the operation of an OpenStack cluster
0 stars 2 forks source link

Add checks for baremetal node health for ironic #131

Open sudeephb opened 6 months ago

sudeephb commented 6 months ago

Openstack Ironic "baremetal nodes" should be monitored for nodes in "Maintenance=True" state as well as provisioning_state=failed or error

For instance, all nodes should have provisioning state of one of the following:

active available managable (this should probably provoke a warning state, as the machine is not consumable by the cloud users) cleaning wait (such as clean wait, callback wait, etc)

If the status is "error" or "cleaning failed" or "managable" we should set an alertable state.

Also, if Maintenance = True, the machine is not available for cloud user consumption, so it should also set an alertable state.

The command to query is "openstack baremetal node list", and should have checks added if the openstack endpoint list includes a service with service_name=ironic or service_type=baremetal.

It might be nice for there to be two checks, one for maintenance mode which can be silenced while still alerting on baremetal nodes that go into 'error' or 'clean failed' for provisioning_state.


Imported from Launchpad using lp2gh.