NRPE checks should be split into individual checks

Hello,

For the NRPE checks, having one monolithic check per role can lead to issues when performing incident management for Contrail services. It is common when an alert fires, to add down-time or mute alerts if they are non-actionable - so if a bug is found in a check, it is silenced.

With the current way in which the Nagios checks are organised, if there is a problem with a single service, if the issue is not service impacting and requires further investigation or potentially fixes to the monitoring, the service can not be individually silenced without also silencing all other checks for all other services on the hosts for the same role (for example, all analytics services, of which there are many).

The recommended way to handle this would be for there to be NRPE checks for each service. This can still be handled with a single script that checks the state of the services in contrail-status - however multiple NRPE commands should be registered for each service to allow separate alerting. To avoid hammering contrail-status, a local cache of the output could be generated and then regenerated when it becomes seconds old.

I and others would be happy to help implement if required, however I feel this would be a good enhancement that follows good alerting and monitoring practices.

Thanks, James

Juniper / contrail-charms

NRPE checks should be split into individual checks #158