Closed benhoyt closed 2 months ago
@benhoyt still busy reviewing ... will continue asap.
@flotter I've reworked this quite a bit now. In particular, I tried to store all the data in change/task data to avoid a separate "m.checks" map and trying to keep that in sync. I almost had it ... and then I remembered (actually, noticed in testing!) the GET /v1/health
endpoint isn't allowed to acquire the state lock, so it couldn't call Checks
. I ended up creating a kind of cache m.health
and a new method Health
that returns a cut-back version of the data just for the health endpoint. I'm not entirely happy with it, but at least the health-related stuff is fairly isolated from the main "checks engine". Happy to consider alternative approaches.
Per spec JU073, this reimplements the health check manager to to use Changes and Tasks do drive the check operations. There are two change kinds, each of which has a single task of the same kind:
perform-check
: used for driving the check while it's "up". The change (and task) finish when the number of failures hits the threshold, at which point it goes into Error status. Each check failure records a task log.recover-check
: used for driving the check while it's "down". The change (and task) finish when the check starts succeeding again, at which point it goes into Done status. Again, each check failure records a task log.We also add a new
change-id
field to the /v1/checks responses, allowing a client and the CLI to look up the change for each check (primarily to get the task logs for failing checks). Thepebble checks
command displays this as follows:We (@hpidcock, @jameinel, and I) debated whether it should switch from perform-check to recover-check after the first failure or only when it hits the threshold (which is how I had the code originally). We decided in the end that it's better and more obvious to switch when it hits the threshold, as otherwise the check status is out of sync with the change kind, and you need a third state. We figured in a Juju context, for example, your alive/ready checks would normally be separate from the change-update checks you want the charm to be notified about. And if you do want them to be the same check, you can easily add a similar check with a different threshold.
Here's an example of that flow (taken from the spec):
As far as the checks system concerns, this is an implementation detail. However, the use of changes and tasks mean that the status of a check operation and check failures can be introspected (via the existing changes API).
Fixes #103.