Open cjdcordeiro opened 1 year ago
This is a good point. It will mean before Pebble starts the /v1/health
endpoint (used as a K8s probe endpoint by Juju) will return an error as it should, then when Pebble starts it'll report success (up
), then if the check is actually down, it'll report down
correctly after N failures. But in the meantime we have a "blip" of up. I originally designed it to default to up
as kind of a "give it the benefit of the doubt".
I'm not sure we should just change the default to down
, because we don't really want it to stay down for a full 10s (the default check period). We could add a third state unknown
and default to that, however, that still has issues for /v1/health
which returns an overall true/false binary result, and again I'm not sure you'd want that to report health=false for a 10 full seconds. Or maybe that is okay? For Juju we set the K8s probes' InitialDelaySeconds
to 30s, so it would work fine for that.
We could make the check manager run the checks right away ... but that isn't right either, because the service/workload being monitored probably isn't up yet anyway.
@hpidcock Thoughts?
I don't think "up" is correct initially, but I'm also not sure defaulting to "down" is correct either. My instinct is to make the default state configurable with "down" as the default.
checks:
chk1:
override: replace
level: ready
initial-status: down
http:
url: http://23234sdffsdfsadf.com/
The alternative is that we don't return any result until all checks have evaluated. I.e. the api request can long poll if its indeterminate
Right, if the expected behaviour is to return a boolean, then yes, both up
or down
would always be misleading. If True|False are the only two options, then this is very open to interpretation and I don't think there's a correct answer. I personally would expect, in this case, the system to be considered as "up", until proven otherwise. Ideally I'd like to see either unknown
or no state at all.
Just as a reference for inspiration, similar healthcheck mechanisms are also employed by Docker and Kubernetes as you know. For Docker, the accepted health status are UNHEALTHY, HEALTHY, STARTING
@hpidcock has some interesting suggestions (an initial-status
config field, or API request waiting till checks have started), however, I think they're both a bit heavy for what we need here. The API request waiting till checks have started raises several questions -- do we wait for all checks? what if the periods are quite long? and so on.
I suggest what we do here: we add a CheckStatusUnknown
(unknown
) in addition to up
and down
, and report that status in the /v1/checks
API and pebble checks
CLI command for checks that haven't run yet at all.
However, I suggest we don't change the behaviour of /v1/health
, which needs to return a binary success/failure response. You could argue that either way, and if we default it to down we get the issue I mentioned above: "I'm not sure you'd want that to report health=false for a 10 full seconds". So I suggest we leave the /v1/health
response as it is today -- in other words, map the statuses like so:
Check status | /v1/health response |
---|---|
unknown | success |
up | success |
down | failure |
Problem
If we list the checks very early in the lifetime of the Pebble plan, the output will return a default misleading "up" status, even though we aren't yet capable of making such an assessment.
Reproduce
Use an intentionally faulty check, like:
Start the server:
pebble run
Immediately list the checks with
pebble checks
, and the output will beNotice the
Status = up
, which is misleading, and will eventually switch to "down".