canonical / charm-openstack-service-checks

Collection of Nagios checks and other utilities that can be used to verify the operation of an OpenStack cluster
0 stars 2 forks source link

Please allow for more sophisticated exceptions to Octavia-related alerts #168

Open Vultaire opened 1 month ago

Vultaire commented 1 month ago

A common problem we've encountered is having o-s-c alert us for any load balancer failure, regardless of if it's a customer workload or not. To be clear, this is a good, sane default, but it has the cost of excessive alerting in some cases. For example, customers can deploy projects, deploy load balancers, and alerts may arise because their targets are down or not yet working as expected - i.e. due to customer error rather than a failure of the undercloud. This can be good information, but in some cases it may be preferable to ignore it.

This issue has many ways it could be addressed, and I don't want to prescribe one specific way to do so. But, a few use cases to consider:

  1. I have an OpenStack deployment which is used by end users. Load balancers created by end users shouldn't trigger alerts by default; customers can raise tickets if they have "real" problems with their load balancers that they need us to investigate. However, I have specific load balancers I want to monitor, or specific projects for which I want to monitor all load balancers.

  2. A customer is spinning up a series of test environments with a common prefix in the name. We normally want to monitor for any load balancer failures, but the customer is using that specific (prefix, suffix, glob, whatever) to indicate that it's a test load balancer and should not trigger alerts.

In other words:

Please consider how we might make the various octavia checks more flexible to enable at least some of these use cases. Thank you.