Open barkbay opened 2 years ago
Liveness probes have been discussed in https://github.com/elastic/cloud-on-k8s/issues/2513. I'm not deep enough into the topic to tell if that covers this proposal as well.
Liveness probes have been discussed in https://github.com/elastic/cloud-on-k8s/issues/2513. I'm not deep enough into the topic to tell if that covers this proposal as well.
I believe https://github.com/elastic/cloud-on-k8s/issues/2513 is about Elasticsearch and Kibana, this one is about the operator.
This is issue is to discuss whether we want to implement and expose custom health checks. They could be then used to define liveness, readiness or startup probes
While it might be tempting to use the webhook server I'm not sure having the webhook started reflects the status of the operator. Others components, like the cached client for example, are started asynchronously and might be required to consider the operator "healthy" or "ready". Also using the webhook pollutes the operator logs with the following messages:
2021/11/09 12:27:49 http: TLS handshake error from 10.124.96.1:52196: EOF
The leader election mechanism should not affect the probes. Once the expected components (webhook and metrics server, cached client ...) are started and ready then the readiness probe should succeed. Same for liveness, even if the instance is not elected it should not mean that it is unhealthy. As a side note it might be a workaround for #5025 if we consider that having a started cached client is a requirement before allowing the
elastic-webhook-server
service to send traffic to the operator's Pods.I'm not sure however that there is a good way to define what is a "healthy" operator ? Maybe adding a test to detect when the leader election has been lost ?
The implementation could rely on the existing
Manager
receiversAddReadyzCheck
andAddHealthzCheck
, for example: