cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 212 forks source link

Implement additional metrics for counting failed HealthChecks #953

Open vlast3k opened 1 month ago

vlast3k commented 1 month ago

Proposed Change

As an operator I want to monitor of the rate of failing liveness healtchecks via metrics So that i can get alerted in case there are some irregularities

Problem Details

We have observed that during diego-cell-evacuation in some cases an exceptionally big amount of liveness-healtchecks times out. After closer investigation we discovered the follwiing:

Currently we are monitoring the CPU Wait as reported from Bosh, but this is sometimes misleading because:

So it is hard to define a consistent metric - when to trigger an alert that something need to be scaled

On the other hand , monitoring the failing healthchecks an especially sharp increases (e.g. from 10 to 1000 per minute) is a very consistent indicator

Currently we are doing it by counting the number of those logs

rep.executing-container-operation.ordinary-lrp-processor.process-reserved-container.run-container.containerstore-run.node-run.liveness-check.run-step.run-step-failed-with-nonzero-status-code

in a kibana dashboard, but triggering alerts from kibana has other operational challenges.

Solution Proposal

Therefore our proposal is to modify the executor in a way that it will emit a Counter that emits the number of failed healtchecks. This way an alert (e.g. via Riemann) can be configured in case of exceptionally high values.

For discussion we have did a POC in this PR https://github.com/cloudfoundry/executor/pull/102 That solves the problem and allows us to monitor the healtchecks. It allows to choose for which checks the counter should be emitted. So far it is not configurable, because for most of the checks it does not make sense.

Depending on our discussions here we may also extend it or change it in a way that it suits the community

Acceptance criteria

Scenario: Diego cell update is performed Given I have enabled emitting metrics for failing healtchecks When the performance of the Diego EBS Volumes is not enough Then I receive the metrics in the monitoring stack and can act on them

Related links

ebroberson commented 4 weeks ago

I think this looks like a good change, but definitely want someone else on the team to review and comment.