Add support for a delay before starting health checks

arturo-seijas commented 2 years ago

It would be useful to have ability to specify a delay before start executing them analogous to kubernetes' initialDelaySeconds:

checks:
    up:
        override: replace
        level: alive
        period: 10s
        timeout: 3s
        threshold: 3
        delay: 0s # this would be the new parameter
        exec:
            command: service nginx status

benhoyt commented 2 years ago

This is a reasonable request, thanks. We did consider it when initially implementing checks, but decided it generally wouldn't be needed -- the period (10s in this case) would almost always cover startup time. Just to get more info: what's your use case, or how long does the container take to start up (i.e., what would your delay value be here)? Could you use a slightly longer period instead?

arturo-seijas commented 2 years ago

We're deploying a web application. On startup it might run some database migrations and spin off an application server. This can take up to a couple minutes, which seems like a lot of time to be detected by adjusting the existing checks.

If we did, in the worst case scenario, we'd risk having the container down for two minutes before triggering a restart, which seems like sacrificing a lot of availability.

We did increase the period to mitigate our issues, but for the reasons above, the way it's configured still triggers some restarts while spinning up.

benhoyt commented 2 years ago

Just copying my quick-n-dirty proof of concept code for the record, before I delete it locally:

    delay := 100 * time.Millisecond // TODO: from plan
    logger.Debugf("Check %q starting with period %v, delay %v", c.config.Name, c.config.Period.Value, delay)

    if delay != 0 {
        // Before starting check, wait for initial delay time.
        select {
        case <-time.After(delay):
        case <-c.ctx.Done():
            logger.Debugf("Check %q stopped: %v", c.config.Name, c.ctx.Err())
            return
        }
    }

mthaddon commented 2 years ago

Btw, just to note that we already see this defined in the liveness checks for containers:

http-get http://:38816/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1

So there is already an initialDelay of 30 seconds on things. Are we talking about changing this, or changing how pebble responds here? I think what's happening is that whatever we define in pebble is effectively a layer on top of this. For example, per https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-startup-probe this check means you end up with the liveness check failing (therefore rescheduling containers) in initialDelaySeconds + failureThreshold × periodSeconds, in this case 35 seconds. But actually we can kind of ignore this because pebble will respond with "OK" until the checks we've defined in the pebble layer as "alive" fail under the conditions we've defined there.

But if pebble itself fails to respond within 35 seconds the pod would be killed?

benhoyt commented 2 years ago

But if pebble itself fails to respond within 35 seconds the pod would be killed?

Yes, I think that analysis is correct, though I think very unlikely that Pebble itself would fail to respond in that timeframe.

When we set up that Juju liveness check to have delay=30s, we were thinking something along the lines of "30 seconds startup time ought to be enough for anybody".

amandahla commented 4 weeks ago

We recently faced a situation where Synapse needed more time to reconcile with the database before starting, but it was restarted due to a pebble check. Is there any plan for this issue?

benhoyt commented 4 weeks ago

Hi @amandahla, we can try to prioritise this and squeeze it into our 24.10 work. Does the startup time of Synapse depend on the data size? In other words, is it variable or fixed?

amandahla commented 3 weeks ago

It's variable, so we considered adding an initial delay to be safe.

IronCore864 commented 3 weeks ago

Hi @merkata, thanks very much for the PR! However, we are still in the process of spec reviewing this feature. Please wait for a while. I'll share the spec doc later and keep you posted :)

IronCore864 commented 3 weeks ago

Here's the spec to be reviewed, hopefully, next week: https://docs.google.com/document/d/1_U92ssI-DGAojkMJUULyur23NkjCRu6gSM0dkr7PuGY/edit.

benhoyt commented 2 weeks ago

Per review discussion with Harry, Tony, and Tiexin, this is more complicated than just adding "initial-delay" to the health check, because there's both the K8s probes and the Pebble checks involved. Also, we consider initial-delay to be a bit of an anti-pattern, because it hard-codes the application startup time, and we'd rather not bake that in.

We have a recommendation of what to do here, and we're planning to make a change to Juju to improve a related area:

1) Recommendation: consider the K8s liveness (level=alive check) success to mean "Pebble is alive" rather than "the application is fully alive" (and failure to mean "this container needs to die"). This means that for charms that take a long time to start up, you should not have a level=alive check (if Pebble's running it will report alive to K8s), and instead use an ordinary Pebble check (without a level) in conjunction with on-check-failure: restart. That way Pebble itself has full control over restarting the service in question.

2) The Juju improvement is to replace the probes' initialDelaySeconds=30 with a startup probe (pointing to the same place) with periodSeconds=1, failureThreshold=30, timeoutSeconds=2. If we do this it will have the effect of the liveness and readiness probes running more quickly (but the overall failure time will be the same). This won't solve your issue, but it will help make applications be respond faster (be considered ready by K8s). Harry has said they can probably do this in Juju 3.6. Juju issue

In conclusion, we're not going to implement "initial-delay" configuration and I'll close this issue. Please reach out on the Matrix charm development channel if you'd like help with the recommendation to use on-check-failure.

benhoyt commented 2 weeks ago

Oops, closing as "not planned".

canonical / pebble

Add support for a delay before starting health checks #145