canonical / pebble

Pebble is a lightweight Linux service manager with layered configuration and an HTTP API.
https://canonical-pebble.readthedocs-hosted.com/
GNU General Public License v3.0
145 stars 54 forks source link

Pebble alive and ready checks #439

Closed amandahla closed 3 months ago

amandahla commented 3 months ago

Hi,

From the docs:

        # (Optional) Defines what happens when each of the named health checks
        # fail. Possible values are:
        #
        # - restart (default): restart the service once
        # - shutdown: shut down and exit the Pebble daemon (with exit code 11)
        # - success-shutdown: shut down and exit Pebble with exit code 0
        # - ignore: do nothing further
        on-check-failure:
            <check name>: restart | shutdown | success-shutdown | ignore

And

        # (Optional) Check level, which can be used for filtering checks when
        # calling the checks API or health endpoint.
        #
        # For the health endpoint, ready implies alive. In other words, if all
        # the "ready" checks are succeeding and there are no "alive" checks,
        # the /v1/health API will return success for level=alive.
        level: alive | ready

I assumed:

But actually:

Is this the expected behavior and the documentation is unclear about it, or should the behavior be changed according to what one can assume from the docs?

benhoyt commented 3 months ago

Hi @amandahla, thanks for the report. There is actually a small bug in the docs -- see this PR for the fix. Basically, there's no such thing as a default for on-check-failure, as that's a map of service name to value pairs, and so doesn't have "defaults" like the on-success and on-failure fields (which are just strings). So I've just removed the word "default" there in the layer definition docs.

However, the other thing is working as expected. The health endpoint section of the README clarifies this:

Ready implies alive, and not-alive implies not-ready. If you've configured an "alive" check but no "ready" check, and the "alive" check is unhealthy, /v1/health?level=ready will report unhealthy as well, and the Kubernetes readiness probe will act on that.

So ready implies alive, but not-ready does not necessarily mean not-alive ("alive" roughly means the network is up, whereas "ready" means the service is ready to serve). Hence why /v1/health?level=alive or pebble health --level=alive returning healthy/true even when the ready check is failing. If you want finer-grained control over ready vs alive, you'll need to define both "ready" and "alive" level checks.

amandahla commented 3 months ago

Thanks for the clarification.

Few questions:

1) Can the on-check-failure be set for both kinds(ready and alive)? If is not set and even without the default, a live check that fails will result in a restart, right?

2) WDYT about changing the:

        # For the health endpoint, ready implies alive. In other words, if all
        # the "ready" checks are succeeding and there are no "alive" checks,
        # the /v1/health API will return success for level=alive.

For something that clarifies that if there are no alive checks, no matter the ready check result, the alive will return healthy/true ?

benhoyt commented 3 months ago

Can the on-check-failure be set for both kinds(ready and alive)?

Yes, on-check-failure can be set for any check, level=alive checks, level=ready checks, and any other checks.

If is not set and even without the default, a live check that fails will result in a restart, right?

No, that's not how it works. Checks, with further configuration, are independent of services. If a check fails, nothing will happen by default (well, only the state of /v1/checks and /v1/health may change). You have to "wire it up" to a service manually using on-check-failure: {<check_name>: restart} to make a specific service restart when it fails.

I'll take another look at the doc wording, to see if that needs to be more explicit.

benhoyt commented 3 months ago

I've tried to clarify the wording a bit more here: https://github.com/canonical/pebble/pull/442 and the layer configuration points people to the longer section for more detail.