Juniper / contrail-charms

Juju charms for Contrail services.
Apache License 2.0
13 stars 22 forks source link

NRPE checks should be split into individual checks #158

Open digitalrane opened 4 years ago

digitalrane commented 4 years ago

Hello,

For the NRPE checks, having one monolithic check per role can lead to issues when performing incident management for Contrail services. It is common when an alert fires, to add down-time or mute alerts if they are non-actionable - so if a bug is found in a check, it is silenced.

With the current way in which the Nagios checks are organised, if there is a problem with a single service, if the issue is not service impacting and requires further investigation or potentially fixes to the monitoring, the service can not be individually silenced without also silencing all other checks for all other services on the hosts for the same role (for example, all analytics services, of which there are many).

The recommended way to handle this would be for there to be NRPE checks for each service. This can still be handled with a single script that checks the state of the services in contrail-status - however multiple NRPE commands should be registered for each service to allow separate alerting. To avoid hammering contrail-status, a local cache of the output could be generated and then regenerated when it becomes seconds old.

I and others would be happy to help implement if required, however I feel this would be a good enhancement that follows good alerting and monitoring practices.

Thanks, James

Andrey-mp commented 4 years ago

Hi James,

I would be very good and helpful if you or someone else can implement this requirement. Also contrail-status can return data in json format which can be used here to simplify parsing.

But it's not so clear for me how it can be implemented if set of services is different and depends on installed version.

Regards.