canonical / pebble

Pebble is a lightweight Linux service manager with layered configuration and an HTTP API.
https://canonical-pebble.readthedocs-hosted.com/
GNU General Public License v3.0
146 stars 54 forks source link

Reconsider checks reporting "up" status before being executed #164

Open cjdcordeiro opened 1 year ago

cjdcordeiro commented 1 year ago

Problem

If we list the checks very early in the lifetime of the Pebble plan, the output will return a default misleading "up" status, even though we aren't yet capable of making such an assessment.

Reproduce

Use an intentionally faulty check, like:

checks:
    chk1:
        override: replace
        level: ready
        http:
            url: http://23234sdffsdfsadf.com/

Start the server: pebble run

Immediately list the checks with pebble checks, and the output will be

Check  Level  Status  Failures
chk1   ready  up      0/3

Notice the Status = up, which is misleading, and will eventually switch to "down".

benhoyt commented 1 year ago

This is a good point. It will mean before Pebble starts the /v1/health endpoint (used as a K8s probe endpoint by Juju) will return an error as it should, then when Pebble starts it'll report success (up), then if the check is actually down, it'll report down correctly after N failures. But in the meantime we have a "blip" of up. I originally designed it to default to up as kind of a "give it the benefit of the doubt".

I'm not sure we should just change the default to down, because we don't really want it to stay down for a full 10s (the default check period). We could add a third state unknown and default to that, however, that still has issues for /v1/health which returns an overall true/false binary result, and again I'm not sure you'd want that to report health=false for a 10 full seconds. Or maybe that is okay? For Juju we set the K8s probes' InitialDelaySeconds to 30s, so it would work fine for that.

We could make the check manager run the checks right away ... but that isn't right either, because the service/workload being monitored probably isn't up yet anyway.

@hpidcock Thoughts?

hpidcock commented 1 year ago

I don't think "up" is correct initially, but I'm also not sure defaulting to "down" is correct either. My instinct is to make the default state configurable with "down" as the default.

hpidcock commented 1 year ago
checks:
    chk1:
        override: replace
        level: ready
        initial-status: down
        http:
            url: http://23234sdffsdfsadf.com/
hpidcock commented 1 year ago

The alternative is that we don't return any result until all checks have evaluated. I.e. the api request can long poll if its indeterminate

cjdcordeiro commented 1 year ago

Right, if the expected behaviour is to return a boolean, then yes, both up or down would always be misleading. If True|False are the only two options, then this is very open to interpretation and I don't think there's a correct answer. I personally would expect, in this case, the system to be considered as "up", until proven otherwise. Ideally I'd like to see either unknown or no state at all.

Just as a reference for inspiration, similar healthcheck mechanisms are also employed by Docker and Kubernetes as you know. For Docker, the accepted health status are UNHEALTHY, HEALTHY, STARTING

benhoyt commented 1 year ago

@hpidcock has some interesting suggestions (an initial-status config field, or API request waiting till checks have started), however, I think they're both a bit heavy for what we need here. The API request waiting till checks have started raises several questions -- do we wait for all checks? what if the periods are quite long? and so on.

I suggest what we do here: we add a CheckStatusUnknown (unknown) in addition to up and down, and report that status in the /v1/checks API and pebble checks CLI command for checks that haven't run yet at all.

However, I suggest we don't change the behaviour of /v1/health, which needs to return a binary success/failure response. You could argue that either way, and if we default it to down we get the issue I mentioned above: "I'm not sure you'd want that to report health=false for a 10 full seconds". So I suggest we leave the /v1/health response as it is today -- in other words, map the statuses like so:

Check status /v1/health response
unknown success
up success
down failure