canonical / charm-openstack-service-checks

Collection of Nagios checks and other utilities that can be used to verify the operation of an OpenStack cluster
0 stars 2 forks source link

check_octavia.py should provide more information on nagios status line, or should log errors to a log file #135

Open sudeephb opened 6 months ago

sudeephb commented 6 months ago

I'm frequently observing Octavia alerts that something is amiss, however by the time I can go take a look, the issue has sometimes self-resolved and I can't run the associated check by hand to determine the details of what went wrong. Or, alternatively, while reviewing events which have occurred previoiusly, the events raised in Nagios lack enough information to allow for meaningful action.

I haven't looked deeply enough, but this may be especially the case when there's something ignored. I get a nagios message which looks like this:

CRITICAL: total_alarms[1], total_crit[1], total_ignored[0], ignoring r'(?:)

...Unfortunately, this doesn't give me anything meaningful in event history in Nagios to look at. I don't even know what load balancer or pool had the critical error; I just know that something was wrong.

I see in the script that we construct a message object by joining multiple strings together with newlines. We may want to consider a different method which results in longer but more useful strings, or we may want to consider having this script also write to a log file so as to allow for longer responses in a way which would be captured by Graylog, or at the very least have something on disk that we can look at after the fact.


Imported from Launchpad using lp2gh.