Readiness check - Githubissues

jpountz commented 4 years ago

It is a common need to check whether a node has started up. Common approaches today include calling existing APIs, like GET / and checking the HTTP response code. However @jasontedor noted that the fact that none of our APIs is designed with this goal is mind means that we might break this use-case inadvertently, which wouldn't be the case if we had a dedicated API.

We'd need to first understand what exact semantics are needed for a readiness check. For instance do we only need to check whether the node has started up, or do we also need to know whether it has formed a cluster that has an elected master?

Then should we make it its own dedicated API, or should we recommend using an existing API for this, like GET /? If the latter, then we will need to document it.

elasticmachine commented 4 years ago

Pinging @elastic/es-core-infra (:Core/Infra/Core)

dakrone commented 4 years ago

should we recommend using an existing API for this, like GET /?

My preference would be a separate API. I think it's a feature that GET / is the lightest weight API we have, and I've used in in the past as way to tell whether a node is really in trouble (ie, does it respond to GET / okay).

rjernst commented 4 years ago

do we only need to check whether the node has started up, or do we also need to know whether it has formed a cluster that has an elected master?

I think we need to have connected to/formed a cluster, and also async initialization like security/watcher services need to have completed once they get cluster state.

Additionally, IMO there is no need for an API. We can connect to and form a cluster with only the transport port bound, and then only bind to http once this is complete. This would allow users/tests to wait on the http port being bound. I started experimenting with this last year to improve how integ tests wait for ES to be ready, and would be happy to pick this back up. Last I remember, there were some issues in security, but I think those may now be solved with transport client being removed from master.

ywelsch commented 4 years ago

I think we need to have connected to/formed a cluster, and also async initialization like security/watcher services need to have completed once they get cluster state.

I wonder how the gateway settings such as gateway.expected_data_nodes should be treated (i.e. whether the cluster should be treated as ready / not ready in that case).

Additionally, IMO there is no need for an API. We can connect to and form a cluster with only the transport port bound, and then only bind to http once this is complete

The main issue I see with this is that you can't get any diagnostic output from the cluster through the APIs as to why the cluster is not forming or why some of the components did not initialize.

jasontedor commented 4 years ago

In the context of Kubernetes, we need to distinguish a liveness check from a readiness check.

A liveness check is used to determine when to restart a container because it's sick.

A readiness check is used to determine when a container is ready to receive requests.

The distinction is important because we don't necessarily want to restart a container because it's temporarily partitioned from the cluster. A failing liveness check is used to restart the container, a failing readiness check is used to stop routing traffic.

In this context, I don't think we need to think about

do we only need to check whether the node has started up, or do we also need to know whether it has formed a cluster that has an elected master?

rather we have dedicated checks for each use case.

In the context of a Docker HEALTHCHECK it seems that using the above language, a liveness check is more appropriate (to restart unhealthy containers). In the case of docker-compose healthcheck, it seems to me that readiness is the more appropriate check as otherwise the check can not be relied up to indicate when the service is ready for dependent services to startup (e.g., Kibana).

sebgl commented 4 years ago

The default AWS EKS LoadBalancer implementation has its own healthcheck logic. It can be configured with http port and path, but not with authentication details. I would have expected it to be bound to the Pod readiness instead, as this is usually what happens with Kubernetes Services.

This also stands true for GCP default Ingress.

We may want to simplify how a user can setup unauthenticated access to the healthcheck endpoint?

elastic / elasticsearch

Readiness check #50187