Closed serathius closed 1 year ago
There are two issues we want to address related to etcd endpoint setup:
Caused by running grpc server under http server which is affected by https://github.com/golang/go/issues/58804.
For the watch starvation we can:
Serving grpc server under http it is an experimental API with different performance and features. It should not be used in production.
To avoid running grpc under http server we can:
--listen-client-http-urls
flag. Complicates configuration, but no worries about production use.Implement changes and execute backports in the order below:
--listen-client-http-urls
.List of etcd http paths:
/debug/vars
- Access to etcd config. Available only via --listen-client-urls
/config/local/log
- Remote configuration of global log level. Removed in v3.5. Available only via --listen-client-urls
/version
- Version endpoint. Available only via --listen-client-urls
/metrics
- Metrics endpoint. Available via both --listen-client-urls
and --listen-metrics-urls
/health
- Health endpoint (same as etdctl endpoint health
). Available only via --listen-client-urls
./v2/*
- etcd V2 API. Available only via --listen-client-urls
(removed in v3.6)/v3/*
- grpc gateway for V3 API. Allows to call V3 API using http and Json. Available only via --listen-client-urls
EDIT 1: based on @ptabor feedback, I removed --client-no-http
flag and proposed backporting --listen-client-http-urls
to avoid deprecation.
EDIT 2: Updated to use @aojea custom connection multiplexer as a v3.6+ long term solution.
cc @ptabor @ahrtr @spzala @mitake
cc @wojtek-t @liggitt
1 [v3.4+] Change the default write scheduler from Priority to Random, which in my testing improves resilience to starvation (not a full immunity).
HI @serathius is it possible to maintain a round-robin scheduler writer? I think it is good for that metrics, which the latency is more predicatable as there is max concurrent streams. The scheduler writer runs without lock. Maybe it could be easier~.
1 [v3.4+] Change the default write scheduler from Priority to Random, which in my testing improves resilience to starvation (not a full immunity).
HI @serathius is it possible to maintain a round-robin scheduler writer? I think it is good for that metrics, which the latency is more predicatable as there is max concurrent streams. The scheduler writer runs without lock. Maybe it could be easier~.
Prefer to avoid backporting untested code especially if it would be written by me :P. I would wait for grpc experts to implement it and test it.
LGTM. I would consider also:
--listen-client-http-urls
. This will drive proactively users to prepare for v3.6 version.Discussed @fuweid idea with @mborsz. Implementing it based on Random scheduler should be much simpler than fixing https://github.com/golang/go/issues/58804 which will need to modify priority scheduler.
I will try to implement round-robin scheduling. Thanks for suggestion @fuweid
[v3.4+] Add flag --listen-client-http-urls that exposes http server on separate endpoint and disables non-grpc handlers on --listen-client-urls. This way we expose pure grpc server, full mitigating the issue but requires a user action.
@serathius thanks!!!
Quick look into implementation of random write scheduler shows that it's isn't so random :P. It prioritizes writing control frames. This also done by priority scheduler. Unfortunately method to check if frame is a control frame is not public https://go.googlesource.com/net/+/master/http2/writesched_random.go#48 (thanks Go).
Meaning that custom implementation of scheduler will already be subpar. I can implement scheduler that treats all frames the same, however I'm not expert in http2, so it's hard to guess what it could break.
EDIT: Or we can guess what it breaks. https://go.googlesource.com/net/+/04296fa82e83b85317bd93ad50dd00460d6d7940%5E%21/http2/writesched_random.go. Nice to see familiar people.
@aojea would love to get your thoughts on this issue.
IIUIC the problem here is with DATA frames , those are the ones that can cause starvation.
Control frames are not carrying much data, they are used for signaling, and per RFC they MUST not be considered for flow-control
https://www.rfc-editor.org/rfc/rfc7540
- The frame type determines whether flow control applies to a frame. Of the frames specified in this document, only DATA frames are subject to flow control; all other frame types do not consume space in the advertised flow-control window. This ensures that important control frames are not blocked by flow control.
One interesting thing is that if you were using the http2 implementation from golang/x/net instead of the one in the standaard library, is that you. probably were using the Random scheduler until last year https://go-review.googlesource.com/c/net/+/382015
Based on this comment from Brad Fitzpatrick https://go-review.googlesource.com/c/net/+/382015/comments/16b761d4_4c27bf6f and that the problem here are DATA frames, I'm +1 on the Random Scheduler if it demonstrates that it solves the problem
It prioritizes writing control frames.
The most enqueued control frame is the connection-level windows update. It's filed by each request body read. But the x/net/http2 only files one connection-level windows update if the pending size >= 4k. Since the ETCD server is using x/net/http2, the number of control frame won't be too much.
Unfortunately method to check if frame is a control frame is not public
Yeah. It is painful. If the isControl
can be exported, we can implement a FIFO scheduler.
In the #15446 , it is going to use grpc-go underlying http2. It seems that the grpc-go has some enhancements for the window update, like BDP https://github.com/grpc/grpc-go/issues/2400.
Just my two cents is that if the FIFO scheduler (just like grpc-go http2) is better than random (we need test result), it's worth to make isControl
exported.
I'll defer to others with more context about whether splitting http and grpc listeners is necessary, and will just comment on the proposed mechanism for splitting
Making --listen-client-urls
mean different things for different versions is very confusing. If I understand the proposal:
--listen-client-urls
means http+grpc prior to 3.6 if no --listen-client-http-urls
is specified--listen-client-urls
means grpc-only prior to 3.6 if --listen-client-http-urls
is specified, or unconditionally in 3.6+Do all existing clients currently expect to speak grpc over http/2? Will there be any client-visible change or impact of splitting grpc and http? Will existing clients need to change what they specify in their config to target both the http and grpc endpoints?
Do all existing clients currently expect to speak grpc over http/2?
Whole V3 etcd API is available on grpc. From pure API stand there is no impact.
Will there be any client-visible change or impact of splitting grpc and http?
As mentioned above /debug/vars
, /config/local/log
, /version
, /metrics
, /health
, v2 API (removed in v3.6) and grpcgateway for v3 API (grpc over websocket&json).
Will existing clients need to change what they specify in their config to target both the http and grpc endpoints?
Depends on client:
--listen-metrics-urls
./health
endpoint. However as mentioned we expect most users (including K8s) to be using etcdctl endpoint health
.Working with @aojea on replacing Cmux, will propose update to the proposal soon. Our main concern is backporting new untested code.
Updated the proposal in https://github.com/etcd-io/etcd/issues/15402#issuecomment-1460061395 PTAL @ptabor @ahrtr @liggitt @aojea
tldr; We will still backport running http and grpc on separate port, but instead making it default, we will use @aojea customized connection multiplexer for etcd thus avoiding a breaking change.
If I understand it correctly, the following proposal
Implement changes and execute backports in the order below:
1. [v3.4+] https://github.com/etcd-io/etcd/pull/15452.
2. [v3.4+] https://github.com/etcd-io/etcd/pull/15446 by passing --listen-client-http-urls.
3. [v3.6+] https://github.com/etcd-io/etcd/pull/15510 by @aojea.
4. [v3.4+] When ready for production use, change to new [frame scheduler round-robin algorithm](https://go-review.git.corp.google.com/c/net/+/478735).
Should be updated as below to avoid confusion & misunderstanding,
Implement changes and execute backports in the order below:
1. [v3.4+] https://github.com/etcd-io/etcd/pull/15452.
2. [Only v3.4 & v3.5] https://github.com/etcd-io/etcd/pull/15446 by passing --listen-client-http-urls.
3. [v3.6+] https://github.com/etcd-io/etcd/pull/15510 by @aojea.
4. [v3.6+] When ready for production use, change to new [frame scheduler round-robin algorithm](https://go-review.git.corp.google.com/c/net/+/478735).
It seems that the link frame scheduler round-robin algorithm isn't accessible outside google?
Please clearly state the impact on K8s. My understanding is that there is no impact if K8s uses etcdctl
instead of /health
to perform the health check in the Probes. cc @neolit123
I am not sure whether K8s uses other http endpoints. cc @liggitt
- [v3.4+] When ready for production use, change to new frame scheduler round-robin algorithm.
Why do you update stable release 3.4 & 3.5 again in future? No matter which way we follow, If there is no any issue, let's keep it as it's.
Clarification
If I understand it correctly, the following proposal
Implement changes and execute backports in the order below: 1. [v3.4+] https://github.com/etcd-io/etcd/pull/15452. 2. [v3.4+] https://github.com/etcd-io/etcd/pull/15446 by passing --listen-client-http-urls. 3. [v3.6+] https://github.com/etcd-io/etcd/pull/15510 by @aojea. 4. [v3.4+] When ready for production use, change to new [frame scheduler round-robin algorithm](https://go-review.git.corp.google.com/c/net/+/478735).
Should be updated as below to avoid confusion & misunderstanding,
Implement changes and execute backports in the order below: 1. [v3.4+] https://github.com/etcd-io/etcd/pull/15452. 2. [Only v3.4 & v3.5] https://github.com/etcd-io/etcd/pull/15446 by passing --listen-client-http-urls. 3. [v3.6+] https://github.com/etcd-io/etcd/pull/15510 by @aojea. 4. [v3.6+] When ready for production use, change to new [frame scheduler round-robin algorithm](https://go-review.git.corp.google.com/c/net/+/478735).
No, the proposal above is consistent with the plan.
It seems that the link frame scheduler round-robin algorithm isn't accessible outside google?
Sorry, assumed that it is public as bot linked it to the issue. https://github.com/golang/go/issues/58804#issuecomment-1480303386
Comment
Please clearly state the impact on K8s. My understanding is that there is no impact if K8s uses
etcdctl
instead of/health
to perform the health check in the Probes. cc @neolit123
With current proposal, there is no impact.
- [v3.4+] When ready for production use, change to new frame scheduler round-robin algorithm.
Why do you update stable release 3.4 & 3.5 again in future? No matter which way we follow, If there is no any issue, let's keep it as it's.
Random scheduler is not enough to fully address the issue, so clusters that will not use --listen-client-http-urls
will be affected.
Are you planning to rollback https://github.com/etcd-io/etcd/pull/15446 on main branch when https://github.com/etcd-io/etcd/pull/15510 finishes? So I assume eventually https://github.com/etcd-io/etcd/pull/15446 will only be applied to 3.4 and 3.5.
Since it impacts all existing etcd users, unfortunately no any feedback yet. The good news is K8s itself isn't impacted; but it also affect collecting metrics.
Are you planning to rollback https://github.com/etcd-io/etcd/pull/15446 on main branch when https://github.com/etcd-io/etcd/pull/15510 finishes? So I assume eventually https://github.com/etcd-io/etcd/pull/15446 will only be applied to 3.4 and 3.5.
No plan for rollback. As discussed on https://github.com/etcd-io/etcd/pull/15510#discussion_r1149084678 there is no 100% guarantee that multiplexing grpc and http2 will always work for all clients. Whole idea is incorrect and for total safely we should recommend users to separate grpc and http.
Since it impacts all existing etcd users, unfortunately no any feedback yet. The good news is K8s itself isn't impacted; but it also affect collecting metrics.
There is no breaking change for K8s, for etcd users we will use https://github.com/etcd-io/etcd/pull/15479 to confirm that there is no breaking change. Would be good to also cover jetcd, however not sure if that will be easy to automate.
No plan for rollback. As discussed on #15510 (comment) there is no 100% guarantee that multiplexing grpc and http2 will always work for all clients. Whole idea is incorrect and for total safely we should recommend users to separate grpc and http.
+1 to this, since grpc uses http2 as the transport,and t also builds some custom transport behaviors on top of it, I think that the best solution is to completely separate them.
cmux or the PR I sent replacing cmux are not completely safe and can't guarantee future compatibility... and in the best of the cases will be always suboptimal since add another layer of buffering
Please clearly state the impact on K8s. My understanding is that there is no impact if K8s uses etcdctl instead of /health to perform the health check in the Probes. cc @neolit123
kubeadm does not use etcdctl for probes. yet, i have not seen kubeadm user reports about such a problem; in our channels at least.
3.x i'm hesitant of proposing any changes to kubeadm static pod manifests for etcd with respect to flags and probes. users that face the issue can opt-in with their own flags / patches. https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/control-plane-flags/#patches if the issue is more impactful than my impression, perhaps we should make changes and all affected k8s releases?
4.x if the separation becomes on by default, i guess it will break kubeadm CI right away so we will be reminded of this. we could still log an issue to track this, though.
kubeadm does not use etcdctl for probes. yet, i have not seen kubeadm user reports about such a problem; in our channels at least.
There is no any behavior change in 3.5 and 3.4 by default.
i'm hesitant of proposing any changes to kubeadm static pod manifests for etcd with respect to flags and probes. users that face the issue can opt-in with their own flags / patches. https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/control-plane-flags/#patches if the issue is more impactful than my impression, perhaps we should make changes and all affected k8s releases?
It's recommended to proactively make change on K8s side, so as to completely resolve the watch stream starvation issue.
Note fix on golang side was merged https://github.com/golang/net/commit/120fc906b30bade8c220769da77801566d7f4ec8
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Returning to the issue as the fix in golang was made available.
golang.org/x/net v0.7.0
we would need to bump it to at least v0.11.0 and then remove the override.I think the last question is: Should/Can we backport the golang.org/x/net bump and change the scheduler on old release.
Closing as the fix is now available by default on all supported minor versions.
For people who come to this thread, it could also happen when there are a lot of huge PUT request (for instance, if there are lot of resource creation in k8s), too.
Below is the "Client Traffic In" panel:
rate(etcd_network_client_grpc_received_bytes_total{...} [3m])
From k8s perspective:
sum(rate(apiserver_watch_cache_events_received_total{...}[2m])) by (pod)
The left side, the red line has a significant lag comparing with the other two apiserver pods (meaning its watch cache has an accumulated delay for receiving new watch events) until I stopped the test, while the right side was fine after we upgraded etcd to 3.5.10.
What happened?
When etcd client is generating high read response load, it can result in watch response stream in the same connection being starved. For example a client with an open watch and running 10 concurrent Range requests each returning around 10MB. The watch might get starved and not get any response for tens of seconds.
Problem does not occur when TLS is not enabled nor when watch is created on separate client/connection.
This affects also K8s (any version) as Kubernetes has one client per resource. Problem will trigger when single K8s resource has a lot of data and there are 10+ concurrent LIST requests for the same resource send to apiserver. For example 10k pods. This can cause serious correctness issues in K8s, like controllers not doing any job as they depend on watch to get updates. For example scheduler not scheduling any pods.
We tested and confirmed that all v3.4+ versions are affected.
Issue affects any watch response type:
What did you expect to happen?
Watch stream should never get blocked no matter the load.
How can we reproduce it (as minimally and precisely as possible)?
Repro for progress notify:
Start etcd with TLS and progress notify, for example:
insert a lot of data into etcd, for example run command below couple of times.
Run program below
Instead of getting logs like:
We get logs like:
And next notify response never comes.
Alternatively run ready e2e test: https://github.com/etcd-io/etcd/commit/5c26deda3b07fbc0d433c15a05e0f8a7232d0d04
Anything else we need to know?
With help of @mborsz I managed to find the root cause:
I have a fix in work in https://github.com/etcd-io/etcd/compare/release-3.4...serathius:etcd:fix-watch-starving Will send PR when ready
Etcd version (please run commands below)
All etcd versions
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
N/A
Relevant log output
No response