Livez/Readyz - Githubissues

logicalhan commented 1 year ago

What would you like to be added?

We currently have a single health endpoint for etcd /health which is used in Kubernetes distros as both liveness and readiness checking. In order to be fully api-compliant, we should have both a liveness check (i.e. /livez) which checks that this individual etcd member is "alive" and does not need to be restarted and a readiness check (i.e. /readyz) which signals that the etcd member is ready to accept traffic.

Why is this needed?

There is a difference between "please restart me I'm that unhealthy" vs "please send me all sorts of traffic, I'm ready for it".

chaochn47 commented 1 year ago

Please check this https://github.com/etcd-io/etcd/issues/13340#issuecomment-963524047

logicalhan commented 1 year ago

Please check this #13340 (comment)

Yeah I don't buy it. No one is going to dig up an obscure github issue in order to properly configure their etcd configurations for Kubernetes.

chaochn47 commented 1 year ago

Yeah that makes sense. We should rethink and document it properly, for example it applies only which etcd version, etc.

logicalhan commented 1 year ago

Yeah that makes sense. We should rethink and document it properly, for example it applies only which etcd version, etc.

As long as you don't touch the existing health endpoint, it's completely backwards compatible and therefore can even be backported.

logicalhan commented 1 year ago

for ref: https://github.com/etcd-io/etcd/pull/16008

ahrtr commented 1 year ago

Thanks @logicalhan for raising this request. I am supportive on it. /health/serializable=<true|false> isn't an explicit API, and it also requires people to understand what's serializable.

/livez and /readyz are more explicit and easy to understand & use.

/livez is similar to (or a syntax sugar ) /health/serializable=true; It just checks local etcd instance's health status, it should return true/healthy as long as local etcd instance is running & healthy. We shouldn't restart the etcd instance when the cluster isn't healthy (e.g. the quorum isn't satisfied) because it will make the situation even worse.

While /readyz should require quorum, and actually check the health of the cluster. Each etcd instance isn't ready to receive traffic until the cluster is healthy. It's similar to (or a syntax sugar of) /health/serializable=false or /health.

ahrtr commented 1 year ago

cc @neolit123

neolit123 commented 1 year ago

+1

serathius commented 1 year ago

Don't want to rush into adding livez/readyz probe. Main problem with existing health probe we just added it to have it without proper consideration.

I want livez to properly reflect fact that etcd needs restart, for example etcd is stuck on stalled storage https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit?usp=sharing.

Readyz should properly reflect fact that etcd is ready to serve traffic. Don't think alarms matter here. It's a degradation, however it doesn't mean we shouldn't serve reads.

TLDR; I would like to have a design written that will do a proper analysis etcd failure modes and propose matching probes to detect them. Example https://github.com/kubernetes-sigs/metrics-server/issues/542

wenjiaswe commented 1 year ago

Thanks for bringing this up @logicalhan.

I will continue work on this.

ahrtr commented 1 year ago

Link to https://github.com/etcd-io/etcd/pull/15440

chaochn47 commented 11 months ago

Reached out to @wenjiaswe for collaboration of the latest updated version of the design doc etcd livez and readyz. Updates resolve the comments / feedback mentioned in the issue and PoC https://github.com/etcd-io/etcd/pull/16008.

/cc @dims

wenjiaswe commented 11 months ago

Thank you @chaochn47, could you please use a google doc so we could comment? Thanks!

wenjiaswe commented 11 months ago

cc @marukozh who is also working on this.

chaochn47 commented 11 months ago

Thank you @chaochn47, could you please use a google doc so we could comment? Thanks!

Done. Anyone in etcd-dev@googlegroups.com should have access to it etcd livez and readyz and can comment.

ahrtr commented 11 months ago

Various discussions are scattered in various places, so I raise my comment under this ticket.

liveness probe

A node is live when both below are satisfied:

It can serve serializable request. But note that if the node is in progress of defragmentation, it can't serve any requests; in such case, we should NOT consider the node as not live due to this rule. Link to https://github.com/etcd-io/etcd/pull/16278
The raft loop isn't blocked. There is one exception, for one-node cluster, there raft loop will be indeed blocked if there is no any client request. One possible way to resolve this is to intentionally trigger events periodically for the liveness check

readyness probe

Basically it shares the same logic as the existing health check (see below), and

A node should be considered to be ready even the alarm NOSPACE is activated.
Should we differentiate local member ready and the cluster ready?

https://github.com/etcd-io/etcd/blob/0a3dc1a8a8b6d06368ad13f9a8f13c038e7ca362/server/etcdserver/api/etcdhttp/health.go#L47-L57

Compatiblity

Do not break the existing /health endpoint!

serathius commented 11 months ago

Please leave your comments on the document https://docs.google.com/document/d/1PaUAp76j1X92h3jZF47m32oVlR8Y-p-arB5XOB7Nb6U/edit?usp=sharing

siyuanfoundation commented 11 months ago

Created a k/k issue to track this https://github.com/kubernetes/kubernetes/issues/120970

siyuanfoundation commented 11 months ago

Tracking work

[x] add livez/readyz endpoint with basic structure for checkers
[ ] add checker for defrag
[ ] add checker for readIndex
[ ] add checker for local file read
[ ] raise a issue and fix existing health probe not checking defrag
[x] raise a issue and fix existing health probe does not respect context from http request

scuzhanglei commented 4 months ago

is there any plan to add a endpoint live command to etcdctl. my situation is I run etcd in a docker compose container, docker compose's healthcheck command is running in the container ,but etcd's base image doesn't contain curl, so can't use curl localhost:2379/livez directly to check it, if there is a etcdctl endpoint live command would be useful.

serathius commented 4 months ago

etcdctl uses GRPC only, we would need to make equivalent of /livez and /readyz in GRPC.

tjungblu commented 4 months ago

@scuzhanglei I think we have that filed under #16276, just needs a cmdline in etcdctl

since we're bumping that thread again already, is there anything left to pick up? I've just "saved" one PR #16959 from @siyuanfoundation from being stale reaped, I think #16858 is also going to fall prey to the evil bot soon.

siyuanfoundation commented 4 months ago

is there any plan to add a endpoint live command to etcdctl. my situation is I run etcd in a docker compose container, docker compose's healthcheck command is running in the container ,but etcd's base image doesn't contain curl, so can't use curl localhost:2379/livez directly to check it, if there is a etcdctl endpoint live command would be useful.

I have tried to add the commands before https://github.com/etcd-io/etcd/commit/293f087fb987421348dd9a4173087bb5f08cb850#diff-ab6fb0684315e16355f6ebe0f4b3cf860b9b2ff5a0fe1b4e4308a680b19f1b0c. Currently I don't have time to rebase it to the most recent implementation of livez/readyz.

Hope someone can pick it up.

henrybear327 commented 4 months ago

is there any plan to add a endpoint live command to etcdctl. my situation is I run etcd in a docker compose container, docker compose's healthcheck command is running in the container ,but etcd's base image doesn't contain curl, so can't use curl localhost:2379/livez directly to check it, if there is a etcdctl endpoint live command would be useful.

I have tried to add the commands before 293f087#diff-ab6fb0684315e16355f6ebe0f4b3cf860b9b2ff5a0fe1b4e4308a680b19f1b0c. Currently I don't have time to rebase it to the most recent implementation of livez/readyz.

Hope someone can pick it up.

Hey @siyuanfoundation, I will pick this issue up!

etcd-io / etcd

Livez/Readyz #16007

What would you like to be added?

Why is this needed?

liveness probe

readyness probe

Compatiblity

Tracking work