Implement additional metrics for counting failed HealthChecks

Proposed Change

As an operator I want to monitor of the rate of failing liveness healtchecks via metrics So that i can get alerted in case there are some irregularities

Problem Details

We have observed that during diego-cell-evacuation in some cases an exceptionally big amount of liveness-healtchecks times out. After closer investigation we discovered the follwiing:

this happens when the diego-cells are updated and with each batch (of 10% of the workload) the remaining 90% of the cells have to start the replacement LRPs
starting each LRP results in high Disk IO since the droplets / docker-layers are being downloaded
in case the disk performance (EBS volumes) is not high enough, this leads to high CPU Wait time
CPU wait tends to block a certain Core from executing commands from other threads
So we observe that Liveness healtchecks configured for 1-5 seconds tend to timeout, even if the container is idling
This happens mostly on overloaded landscapes, and increasing the disk-prformance from the default 125 MB/s to 500 MB/s solves the problem

Currently we are monitoring the CPU Wait as reported from Bosh, but this is sometimes misleading because:

on VMs with few cores, e.g. 4, one core waiting is 25% cpu wait
on VMs with 128+ cores, one core waiting is < 1%

So it is hard to define a consistent metric - when to trigger an alert that something need to be scaled

On the other hand , monitoring the failing healthchecks an especially sharp increases (e.g. from 10 to 1000 per minute) is a very consistent indicator

Currently we are doing it by counting the number of those logs

rep.executing-container-operation.ordinary-lrp-processor.process-reserved-container.run-container.containerstore-run.node-run.liveness-check.run-step.run-step-failed-with-nonzero-status-code

in a kibana dashboard, but triggering alerts from kibana has other operational challenges.

Solution Proposal

Therefore our proposal is to modify the executor in a way that it will emit a Counter that emits the number of failed healtchecks. This way an alert (e.g. via Riemann) can be configured in case of exceptionally high values.

For discussion we have did a POC in this PR https://github.com/cloudfoundry/executor/pull/102 That solves the problem and allows us to monitor the healtchecks. It allows to choose for which checks the counter should be emitted. So far it is not configurable, because for most of the checks it does not make sense.

Depending on our discussions here we may also extend it or change it in a way that it suits the community

Acceptance criteria

Scenario: Diego cell update is performed Given I have enabled emitting metrics for failing healtchecks When the performance of the Diego EBS Volumes is not enough Then I receive the metrics in the monitoring stack and can act on them

cloudfoundry / diego-release