canonical / charm-openstack-service-checks

Collection of Nagios checks and other utilities that can be used to verify the operation of an OpenStack cluster
0 stars 2 forks source link

Opinion: o-s-c update defaults for nova service checks #101

Closed sudeephb closed 7 months ago

sudeephb commented 7 months ago

I'm growing more and more wary about the defaults for the nova service checks in o-s-c.

Checking for number of units in a host aggregate (nova_warn and nova_crit options) strikes me as non-actionable. The number of hosts in an aggregate is up to the customer, and there are valid use cases for having 0 hosts in an aggregate, e.g. using a hostagg as a spare pool of machines.

Similarly, having hosts disabled in the compute service list is a valid use case, e.g. for having hosts in hardware maintenance, and alerting about this is of little value.

Therefore I'd propose to change the defaults:

nova_warn, nova_crit = -1 # disable skip-disabled = True # skip the disabled compute check by default


Imported from Launchpad using lp2gh.

sudeephb commented 7 months ago

(by aluria) Hi! There is another Juju config parameter (skipped_host_aggregates) which is empty by default but could list (comma-separated) as many aggregate hosts as you don't want to monitor for nova_warn and nova_crit thresholds.

The rationale behind monitoring aggregate hosts is to avoid running out of resources due to hardware issues. Setting -1 by default would essentially make those options ignored. When a managed service is not in sync with customer operations, it makes sense to disable them (as you said, there's no action to be taken by the undercloud operators. Besides, false positive may occur as HA overcloud services may not be implemented).

Similarly, skip-disabled (triggers a warning when flag is disabled) was implemented to avoid missing hardware that's been too long out of service. Making it the default would also be as not having such option implemented. Alternative means to track the list of nodes out of service may sound like a better option, though. In this case, I agree skip-disabled default value should be True.

sudeephb commented 7 months ago

(by peter-sabaini) Hey Alvaro,

I get why those checks are there it's just that I'm finding them less useful than I had hoped :-)

Wrt to monitoring hostaggs nodecount IMHO one of the problems with this is also that resource tracking at this level is rather coarse, and Grafana/Prom do a better job of gauging capacity.

Wrt to the skip-disabled warning, I feel like since it's non-actionable as well (on most clouds there is some hardware in maintenance some of the time) it soon falls prey to alert fatigue, respectively helps build it too.

cheers, peter.

sudeephb commented 7 months ago

(by eric-chen) This charm is no longer being actively maintained. Please consider using the new Canonical Observability Stack instead. (https://charmhub.io/topics/canonical-observability-stack) I will close this feature request