Open logicalhan opened 1 year ago
Please check this https://github.com/etcd-io/etcd/issues/13340#issuecomment-963524047
Please check this #13340 (comment)
Yeah I don't buy it. No one is going to dig up an obscure github issue in order to properly configure their etcd configurations for Kubernetes.
Yeah that makes sense. We should rethink and document it properly, for example it applies only which etcd version, etc.
Yeah that makes sense. We should rethink and document it properly, for example it applies only which etcd version, etc.
As long as you don't touch the existing health endpoint, it's completely backwards compatible and therefore can even be backported.
Thanks @logicalhan for raising this request. I am supportive on it. /health/serializable=<true|false>
isn't an explicit API, and it also requires people to understand what's serializable
.
/livez
and /readyz
are more explicit and easy to understand & use.
/livez
is similar to (or a syntax sugar ) /health/serializable=true
; It just checks local etcd instance's health status, it should return true/healthy as long as local etcd instance is running & healthy. We shouldn't restart the etcd instance when the cluster isn't healthy (e.g. the quorum isn't satisfied) because it will make the situation even worse.
While /readyz
should require quorum, and actually check the health of the cluster. Each etcd instance isn't ready to receive traffic until the cluster is healthy. It's similar to (or a syntax sugar of) /health/serializable=false
or /health
.
cc @neolit123
+1
Don't want to rush into adding livez/readyz probe. Main problem with existing health probe we just added it to have it without proper consideration.
I want livez to properly reflect fact that etcd needs restart, for example etcd is stuck on stalled storage https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit?usp=sharing.
Readyz should properly reflect fact that etcd is ready to serve traffic. Don't think alarms matter here. It's a degradation, however it doesn't mean we shouldn't serve reads.
TLDR; I would like to have a design written that will do a proper analysis etcd failure modes and propose matching probes to detect them. Example https://github.com/kubernetes-sigs/metrics-server/issues/542
Thanks for bringing this up @logicalhan.
I will continue work on this.
Reached out to @wenjiaswe for collaboration of the latest updated version of the design doc etcd livez and readyz. Updates resolve the comments / feedback mentioned in the issue and PoC https://github.com/etcd-io/etcd/pull/16008.
/cc @dims
Thank you @chaochn47, could you please use a google doc so we could comment? Thanks!
cc @marukozh who is also working on this.
Thank you @chaochn47, could you please use a google doc so we could comment? Thanks!
Done. Anyone in etcd-dev@googlegroups.com
should have access to it etcd livez and readyz and can comment.
Various discussions are scattered in various places, so I raise my comment under this ticket.
A node is live when both below are satisfied:
Basically it shares the same logic as the existing health check (see below), and
Should we differentiate local member ready
and the cluster ready
?
Do not break the existing /health
endpoint!
Please leave your comments on the document https://docs.google.com/document/d/1PaUAp76j1X92h3jZF47m32oVlR8Y-p-arB5XOB7Nb6U/edit?usp=sharing
Created a k/k issue to track this https://github.com/kubernetes/kubernetes/issues/120970
is there any plan to add a endpoint live
command to etcdctl.
my situation is I run etcd in a docker compose container, docker compose's healthcheck command is running in the container ,but etcd's base image doesn't contain curl, so can't use curl localhost:2379/livez
directly to check it, if there is a etcdctl endpoint live
command would be useful.
etcdctl uses GRPC only, we would need to make equivalent of /livez
and /readyz
in GRPC.
@scuzhanglei I think we have that filed under #16276, just needs a cmdline in etcdctl
since we're bumping that thread again already, is there anything left to pick up? I've just "saved" one PR #16959 from @siyuanfoundation from being stale reaped, I think #16858 is also going to fall prey to the evil bot soon.
is there any plan to add a
endpoint live
command to etcdctl. my situation is I run etcd in a docker compose container, docker compose's healthcheck command is running in the container ,but etcd's base image doesn't contain curl, so can't usecurl localhost:2379/livez
directly to check it, if there is aetcdctl endpoint live
command would be useful.
I have tried to add the commands before https://github.com/etcd-io/etcd/commit/293f087fb987421348dd9a4173087bb5f08cb850#diff-ab6fb0684315e16355f6ebe0f4b3cf860b9b2ff5a0fe1b4e4308a680b19f1b0c. Currently I don't have time to rebase it to the most recent implementation of livez/readyz.
Hope someone can pick it up.
is there any plan to add a
endpoint live
command to etcdctl. my situation is I run etcd in a docker compose container, docker compose's healthcheck command is running in the container ,but etcd's base image doesn't contain curl, so can't usecurl localhost:2379/livez
directly to check it, if there is aetcdctl endpoint live
command would be useful.I have tried to add the commands before 293f087#diff-ab6fb0684315e16355f6ebe0f4b3cf860b9b2ff5a0fe1b4e4308a680b19f1b0c. Currently I don't have time to rebase it to the most recent implementation of livez/readyz.
Hope someone can pick it up.
Hey @siyuanfoundation, I will pick this issue up!
What would you like to be added?
We currently have a single health endpoint for etcd
/health
which is used in Kubernetes distros as both liveness and readiness checking. In order to be fully api-compliant, we should have both a liveness check (i.e./livez
) which checks that this individual etcd member is "alive" and does not need to be restarted and a readiness check (i.e./readyz
) which signals that the etcd member is ready to accept traffic.Why is this needed?
There is a difference between "please restart me I'm that unhealthy" vs "please send me all sorts of traffic, I'm ready for it".