canonical / charm-openstack-service-checks

Collection of Nagios checks and other utilities that can be used to verify the operation of an OpenStack cluster
0 stars 4 forks source link

CRITICAL alert when an instance is shut down and its port is in DOWN status #179

Closed jrodrigu-canonical closed 2 weeks ago

jrodrigu-canonical commented 1 month ago

Similarly to LP#2021509, when an instance is shut down, its allocated port in OVN will be DOWN, and a CRITICAL alert will be triggered by check_ports.cfg (/usr/local/lib/nagios/plugins/check_resources.py port --all). The situation where an instance is shut down, and its allocated port is DOWN is common in day-to-day business, and should not be considered as CRITICAL. A WARNING alert would probably be more appropriate.

This issue differs from the bugfix of LP#2021509, as in that bugfix the port.binding_vif_type must be "unbound":

+            if port.status == "DOWN" and port.binding_vif_type == "unbound":
+                skip_ids.append(port.id)

while in the described situation, binding_vif_type is always set to "ovs", therefore, the port is not skipped and triggers the CRITICAL alert.

pponnuvel commented 1 month ago

How do we differentiate between (1) "an instance was shutdown gracefully" and (2) "an instance wasn't supposed to be up but died"? If that's not possible, it's probably better to keep this as is.

(For the latter case, it probably makes sense to keep it as CRITICAL anyway.)

jrodrigu-canonical commented 4 weeks ago

Hi Pon, as discussed in MM, maybe we could retrieve the status of the VM where the port is attached to? (e.g. openstack server show <id>) There should be a different value in the status fields that points to the reason why the VM is down (OS-EXT-STS:power_state, OS-EXT-STS:task_state, OS-EXT-STS:vm_state, ...)