Operator health checks - Githubissues

barkbay commented 2 years ago

This is issue is to discuss whether we want to implement and expose custom health checks. They could be then used to define liveness, readiness or startup probes

While it might be tempting to use the webhook server I'm not sure having the webhook started reflects the status of the operator. Others components, like the cached client for example, are started asynchronously and might be required to consider the operator "healthy" or "ready". Also using the webhook pollutes the operator logs with the following messages: 2021/11/09 12:27:49 http: TLS handshake error from 10.124.96.1:52196: EOF

The leader election mechanism should not affect the probes. Once the expected components (webhook and metrics server, cached client ...) are started and ready then the readiness probe should succeed. Same for liveness, even if the instance is not elected it should not mean that it is unhealthy. As a side note it might be a workaround for #5025 if we consider that having a started cached client is a requirement before allowing the elastic-webhook-server service to send traffic to the operator's Pods.

I'm not sure however that there is a good way to define what is a "healthy" operator ? Maybe adding a test to detect when the leader election has been lost ?

The implementation could rely on the existing Manager receivers AddReadyzCheck and AddHealthzCheck, for example:

if err := mgr.AddReadyzCheck("clientCacheStarted", func(req *http.Request) error {
    // Wait 1 second, we want to return a feedback to the kubelet in timely fashion.
    ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)
    defer cancel()
    if !mgr.GetCache().WaitForCacheSync(ctx) {
        return errors.New("client cache not synced yet")
    }
    return nil
}); err != nil {
    log.Error(err, "Failed to add readiness check controller")
    return err
}

krichter722 commented 2 months ago

Liveness probes have been discussed in https://github.com/elastic/cloud-on-k8s/issues/2513. I'm not deep enough into the topic to tell if that covers this proposal as well.

barkbay commented 2 months ago

Liveness probes have been discussed in https://github.com/elastic/cloud-on-k8s/issues/2513. I'm not deep enough into the topic to tell if that covers this proposal as well.

I believe https://github.com/elastic/cloud-on-k8s/issues/2513 is about Elasticsearch and Kibana, this one is about the operator.

elastic / cloud-on-k8s

Operator health checks #5032