feat: perform health checks via changes and tasks

benhoyt commented 2 months ago

Per spec JU073, this reimplements the health check manager to to use Changes and Tasks do drive the check operations. There are two change kinds, each of which has a single task of the same kind:

perform-check: used for driving the check while it's "up". The change (and task) finish when the number of failures hits the threshold, at which point it goes into Error status. Each check failure records a task log.
recover-check: used for driving the check while it's "down". The change (and task) finish when the check starts succeeding again, at which point it goes into Done status. Again, each check failure records a task log.

We also add a new change-id field to the /v1/checks responses, allowing a client and the CLI to look up the change for each check (primarily to get the task logs for failing checks). The pebble checks command displays this as follows:

Check  Level  Status  Failures  Change
chk1   -      up      0/3       1
chk2   -      down    1/1       2 (cannot perform check: blah blah error)
chk3   alive  down    42/3      3 (this is a long truncated error messag... run "pebble tasks 3" for more)

We (@hpidcock, @jameinel, and I) debated whether it should switch from perform-check to recover-check after the first failure or only when it hits the threshold (which is how I had the code originally). We decided in the end that it's better and more obvious to switch when it hits the threshold, as otherwise the check status is out of sync with the change kind, and you need a third state. We figured in a Juju context, for example, your alive/ready checks would normally be separate from the change-update checks you want the charm to be notified about. And if you do want them to be the same check, you can easily add a similar check with a different threshold.

Here's an example of that flow (taken from the spec):

initial state:
  change 1 - Doing
task 1: perform-check - Doing

on first error:
  change 1 - Error
task 1: perform-check - Error (contains first failure log(s) - up to 10)
  change 2 - Doing
task 2: recover-check - Doing

on second (or subsequent) errors:
  change 1 - Error
task 1: perform-check - Error
  change 2 - Doing
task 2: recover-check - Doing (contains last failure log(s) - up to 10)

now on success:
  change 1 - Error
task 1: perform-check - Error
  change 2 - Done
task 2: recover-check - Done (keeps last failure log(s) - up to 10)
  change 3 - Doing
task 3: perform-check - Doing

As far as the checks system concerns, this is an implementation detail. However, the use of changes and tasks mean that the status of a check operation and check failures can be introspected (via the existing changes API).

Fixes #103.

flotter commented 2 months ago

@benhoyt still busy reviewing ... will continue asap.

benhoyt commented 2 months ago

@flotter I've reworked this quite a bit now. In particular, I tried to store all the data in change/task data to avoid a separate "m.checks" map and trying to keep that in sync. I almost had it ... and then I remembered (actually, noticed in testing!) the GET /v1/health endpoint isn't allowed to acquire the state lock, so it couldn't call Checks. I ended up creating a kind of cache m.health and a new method Health that returns a cut-back version of the data just for the health endpoint. I'm not entirely happy with it, but at least the health-related stuff is fairly isolated from the main "checks engine". Happy to consider alternative approaches.

canonical / pebble

feat: perform health checks via changes and tasks #409