Open Starblade42 opened 4 years ago
Are these clusters changing on a regular basis? CDS updates are known to be somewhat expensive (though I wouldn't expect them to block the main thread for as long as you're describing). If most of the clusters are the same then each update should just be a hash of the proto + compare with existing, which still isn't free but >100s update latency sounds like a lot.
I'm not sure if we've been looking much into optimizing CDS cluster load time (maybe @jmarantz knows?), but the biggest improvement you'll get is most likely from using Delta CDS, where only the updated clusters are sent instead of the entire state of the world.
Yeah, my understanding is the Project Contour doesn’t use this API yet and that there would be significant work required to make use of the delta API.
I’m also a bit unclear on the underlying details, but however the XDS apis are being used in Contour, it seems like some kinds of typical/normal changes in the state of the kubernetes cluster objects prompt a CDS evaluation, if not any real update, because a cluster that’s in steady state as far as objects goes (but not necessarily pods/endpoints) can cause envoy to get deadlocked like I described with 7000+ clusters.
I believe that the expensiveness of the CDS and possibly something in the gRPC sessions are combining to cause the issue.
On Fri, Jul 17, 2020 at 6:13 PM Snow Pettersen notifications@github.com wrote:
Are these clusters changing on a regular basis? CDS updates are known to be somewhat expensive (though I wouldn't expect them to block the main thread for as long as you're describing). If most of the clusters are the same then each update should just be a hash of the proto + compare with existing, which still isn't free but >100s update latency sounds like a lot.
I'm not sure if we've been looking much into optimizing CDS cluster load time (maybe @jmarantz https://github.com/jmarantz knows?), but the biggest improvement you'll get is most likely from using Delta CDS, where only the updated clusters are sent instead of the entire state of the world.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/12138#issuecomment-660389242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOT7S7TAYWLWYHLDS6UU3DR4DSK7ANCNFSM4O46FJUQ .
-- -Jonathan Huff
CDS cluster initialization is definitely taking time. For 800 clusters, we have seen 20s
to initialize - while Delta CDS helps to update new clusters, this slowness is actually affecting the envoy readiness during startup.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.
A similar behaviour happened with some of our nodes. The node I investigated received 280 clusters and removed 1, as shown in this log:
[2020-10-12 18:00:14.160][54554][info][upstream] [external/envoy/source/common/upstream/cds_api_impl.cc:77] cds: add 280 cluster(s), remove 1 cluster(s)
And, as you can see, the last CDS update time is approximately to that time:
# TYPE envoy_cluster_manager_cds_update_time gauge
envoy_cluster_manager_cds_update_time{envoy_dc="someplace"} 1602518414207
After that, the pending EDS responses were processed and Envoy got stuck. This issue started since we started to using the GRPC to communicate with our Controlplane. We used REST before that.
The issue only solves when we restart Envoy. We are using version 1.14.5.
EDIT: This behaviour was detected around 13h on October 13th.
EDIT 2: The issue was caused by closing the connection between Envoy and Controlplane. In this context, we haven't enabled TCP keepalive on connections to Controlplane. So, when the firewall drops the packets from this TCP connection, Envoy was not able to detect that and waited indefinitely for the xDS response from Controlplane.
Thanks
@htuch @mattklein123 can we please reopen this? This is still an issue
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
We have similar understanding around cluster scalability. I think it would be helpful to have some flamegraphs from folks where possible so we can validate whether these are the same root cause. @pgenera is working on landing https://github.com/envoyproxy/envoy/pull/14167 to benchmark. We have known overheads in stats processing that @jmarantz has landed some recent PRs to address, see https://github.com/envoyproxy/envoy/pull/14028 and https://github.com/envoyproxy/envoy/pull/14312.
See also #14439 . Moreover, for any issues with CDS update performance, flame-graphs would be most helpful!
I experimented a bit with a fake xDS server and tried to update 10000 routes containing RE2-based matching rules. It took 1-2 secs to compile them all. Then I tried to load a STRICT_DNS
cluster with 5000 endpoints and it took more than 20 secs on my machine.
The flamegraph points to Envoy::Upstream::StrictDnsClusterImpl::ResolveTarget
constructor as the main offender:
Turned out that every ResolveTarget
creates also a full copy of envoy::config::endpoint::v3::LocalityLbEndpoints
.
The fix seems to be trivial, but I wonder if the reporters use STRICT_DNS
as well. @Starblade42, @ramaraochavali can you confirm? If not then could you post a sample config you load with CDS.
Nice catch! Any downside to just making that fix?
The fix I tried is
--- a/source/common/upstream/strict_dns_cluster.h
+++ b/source/common/upstream/strict_dns_cluster.h
@@ -38,7 +38,7 @@ private:
uint32_t port_;
Event::TimerPtr resolve_timer_;
HostVector hosts_;
- const envoy::config::endpoint::v3::LocalityLbEndpoints locality_lb_endpoint_;
+ const envoy::config::endpoint::v3::LocalityLbEndpoints& locality_lb_endpoint_;
const envoy::config::endpoint::v3::LbEndpoint lb_endpoint_;
HostMap all_hosts_;
};
With it applied Envoy seems to load the update instantly, but //test/common/upstream:upstream_impl_test
and //test/integration:hotrestart_test
start to fail. So, probably it's not that trivial :) I'll take a look tomorrow.
Yeah I was wondering about lifetime issues. In a quick scan it looks like there isn't a ton of data pulled from the endpoints structure in this class; maybe it could be pulled out individually in the ctor rather than copying the whole thing?
A deeper fix might be to move away from using the proto representation for endpoints inside DNS clusters. For convenience, I think we've inherited this view of endpoints that mostly makes sense for static or EDS endpoints, but the merge costs are non-trivial at scale in this case.
Submitted https://github.com/envoyproxy/envoy/pull/15013 as an attempt to fix it.
@rojkov Thanks for that fix. But this happens also when majority of the clusters are EDS. So your PR will not fully fix this issue.
@ramaraochavali Alright, I'll drop the close tag from the PR's description to keep this issue open for now.
I've played a bit more with CDS and EDS: loaded 10000 clusters with EDS load assignments. For Envoy compiled with -c opt
it took about 15 secs. In case of -c dbg
- about 150 secs. So, compilation options can be very impactful.
But anyway it seems in the process of a CDS update there are three phases:
The CPU profiles for these phases are very different. Here's the first phase.
By the end of this phase Envoy consumes about 500M.
Then the second phase starts
Along this phase Envoy starts to consume memory faster until it allocates about 3800M. Here's the heap profile just before the third phase.
Then Envoy sends the request to my fake ADS server. At this point I see a burst in the network traffic: about 450M is sent from Envoy to ADS. After that Envoy deallocates memory, its consumption drops back to 500M.
So, perhaps in congested cluster networks the third phase may take longer.
Adding a single cluster to these 10000 in a new CDS update doesn't seem to put any significant load on CPU (at least comparing with the initial loading of 10000).
Thanks for collecting these graphs. I assume from the numbers above that these were collected with. "-c dbg", right? Are you reluctant to collect them with "-c opt" because the names in the flame-graphs are less helpful, or you need line info somewhere?
One thing I use sometimes is "OptDebug": --compilation_mode=opt --cxxopt=-g --cxxopt=-ggdb3
. That might be worth a shot.
Oh, thank you! That's helpful. I used "-c dbg" to get more meaningful function names indeed. With the OptDebug options the call stacks are less deep, but still meaningful:
There's some pretty strange stuff in the pprof heap profile. E.g. Node
MergeFrom
is allocating 2GB? What's going on there? :-D
There's some pretty strange stuff in the pprof heap profile. E.g.
Node
MergeFrom
is allocating 2GB? What's going on there? :-D
Maybe the image is rendering "alloc_space" (+X bytes for malloc(x), +0 for free()). @rojkov can you confirm?
Maybe the image is rendering "alloc_space" (+X bytes for malloc(x), +0 for free()). @rojkov can you confirm?
No, that's "inuse".
This is how the profiler log looked like:
Dumping heap profile to /tmp/orig.heapprof.0001.heap (100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0002.heap (200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0003.heap (300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0004.heap (400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0005.heap (500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0006.heap (600 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0007.heap (700 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0008.heap (800 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0009.heap (900 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0010.heap (1000 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0011.heap (1100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0012.heap (1200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0013.heap (1300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0014.heap (1400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0015.heap (1500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0016.heap (1600 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0017.heap (1700 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0018.heap (1800 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0019.heap (1900 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0020.heap (2000 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0021.heap (2100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0022.heap (2200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0023.heap (2300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0024.heap (2400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0025.heap (2500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0026.heap (2600 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0027.heap (2700 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0028.heap (2800 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0029.heap (2900 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0030.heap (3000 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0031.heap (3100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0032.heap (3200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0033.heap (3300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0034.heap (3400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0035.heap (3500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0036.heap (8610 MB allocated cumulatively, 3538 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0037.heap (3638 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0038.heap (3738 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0039.heap (3838 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0040.heap (3938 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0041.heap (4038 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0042.heap (10176 MB allocated cumulatively, 3739 MB currently in use) <- Here one can see a spike in network traffic from Envoy to xDS
Dumping heap profile to /tmp/orig.heapprof.0043.heap (11207 MB allocated cumulatively, 543 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0044.heap (12231 MB allocated cumulatively, 546 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0045.heap (13263 MB allocated cumulatively, 543 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0046.heap (14287 MB allocated cumulatively, 546 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0047.heap (15319 MB allocated cumulatively, 543 MB currently in use)
New findings:
when Envoy restarts it triggers this code path upon CDS initialization
// Remove the previous cluster before the cluster object is destroyed.
secondary_init_clusters_.remove_if(
[name_to_remove = cluster.info()->name()](ClusterManagerCluster* cluster_iter) {
return cluster_iter->cluster().info()->name() == name_to_remove;
});
secondary_init_clusters_.push_back(&cm_cluster);
Here secondary_init_clusters_
is std::list
. In case Envoy loads 10k clusters this secondary_init_clusters_.remove_if()
line takes 4 sec, in case of 30k it takes about 70 sec. Probably hash map would be a better choice here.
Hashes for every Cluster
message are calculated twice. The first time is here and the second time is here. Though performance impact is not that huge. The initial hashing of 30k messages takes 400msec. When an update arrives it takes about 90msec for the same 30k messages.
That’s excellent work! It would be awesome if a data structure change could bring down the operational time by a lot.
You can use the --ignore-unknown-dynamic-fields
command line flag to disable detection of deprecated fields in the config. This will remove the cost of VersionUtil::scrubHiddenEnvoyDeprecated
.
It also looks like your control plane uses v2 API. If so converting to v3 should eliminate VersionConverter::...
costs.
I was curious why an update with 10k EDS clusters turns into >100M traffic from Envoy to the server and sniffed it with Wireshark. So, every EDS discovery request for a cluster includes a copy of Node
and that adds up ~11k: 10k * 11k ~= 110M.
AFAIU the Node message is needed to identify the client. I tried to limit the amount of data included in a request to the Node's id
and cluster
fields. It reduced the traffic to 6M:
@@ -63,13 +64,20 @@ void GrpcMuxImpl::sendDiscoveryRequest(const std::string& type_url) {
if (api_state.must_send_node_ || !skip_subsequent_node_ || first_stream_request_) {
// Node may have been cleared during a previous request.
- request.mutable_node()->CopyFrom(local_info_.node());
+ envoy::config::core::v3::Node n;
+ n.set_id(local_info_.node().id());
+ n.set_cluster(local_info_.node().cluster());
+ request.mutable_node()->CopyFrom(n);
api_state.must_send_node_ = false;
} else {
@@ -93,7 +103,11 @@ GrpcMuxWatchPtr GrpcMuxImpl::addWatch(const std::string& type_url,
// TODO(gsagula): move TokenBucketImpl params to a config.
if (!apiStateFor(type_url).subscribed_) {
apiStateFor(type_url).request_.set_type_url(type_url);
- apiStateFor(type_url).request_.mutable_node()->MergeFrom(local_info_.node());
+ envoy::config::core::v3::Node n;
+ n.set_id(local_info_.node().id());
+ n.set_cluster(local_info_.node().cluster());
+ apiStateFor(type_url).request_.mutable_node()->MergeFrom(n);
apiStateFor(type_url).subscribed_ = true;
subscriptions_.emplace_back(type_url);
if (enable_type_url_downgrade_and_upgrade_) {
Also it helped to slash a couple of seconds off GrpcMuxImpl::sendDiscoveryRequest()
:
With full Node:
Duration(us) # Calls Mean(ns) StdDev(ns) Min(ns) Max(ns) Category Description
2332069 40020 58272 13650.1 19552 382876 done GrpcMuxImpl::sendDiscoveryRequest
291161 10006 29098 5528.3 16645 184649 done GrpcMuxImpl::addWatch
With limited Node:
Duration(us) # Calls Mean(ns) StdDev(ns) Min(ns) Max(ns) Category Description
225534 40020 5635 3212.79 2367 124869 done GrpcMuxImpl::sendDiscoveryRequest
22091 10006 2207 1740.46 974 81893 done GrpcMuxImpl::addWatch
Looks like Envoy is bound to perform slightly worse with every new release as the number of extensions enumerated in Node
increases. Can we announce node capabilities only once per gRPC session?
@rojkov this is already supported for SotW (but not delta) xDS, see https://github.com/envoyproxy/envoy/blob/07c4c17be61c77d87d2c108b0775f2e606a7ae12/api/envoy/config/core/v3/config_source.proto#L107.
I'm thinking this is something else we would prefer default true (but need to do a deprecation dance to move to BoolValue
).
@adisuissa this might be another potential cause of buffer bloat in the issue you are looking at.
https://github.com/envoyproxy/envoy/blob/2966597391b9c7743dab1186f214229ca95e0243/source/common/upstream/cds_api_impl.cc#L52-L101
This is a performance issue, not a bug per se.
When doing CDS updates with many clusters, Envoy will often get "stuck" evaluating the CDS update. This manifests as EDS failing, and in more extreme cases, the envoy ceases to receive any XDS updates. When this happens, Envoy needs to be restarted to get it updating again.
In our case we're seeing issues with the current implementation of
void CdsApiImpl::onConfigUpdate
with a number of clusters in the 3000-7000 range. If envoy could speedily evaluate a CDS update with 10000 clusters this would represent a HUGE improvement in Envoy's behavior for us. Right now, only around 2500 clusters in a CDS update seems to evaluate in reasonable amount of time.Because the function
void CdsApiImpl::onConfigUpdate
pauses EDS while doing CDS evaluation, envoy's config will drift. With many clusters in CDS, this can mean envoy is hundreds of seconds behind what is current, and results in 503's.Some context:
Contour currently doesn't use the incremental APIs of Envoy, so when K8s services change in the cluster it sends ALL of the current config again to Envoy, which means a small change like adding or removing a K8s SVC which maps to an envoy cluster results in Envoy having to re-evaluate ALL clusters.
With enough K8s Services behind an Ingress (7000+) envoy can spontaneously cease to receive any new updates indefinitely, and will fail to do EDS because it gets stuck in the CDS evaluation.
Given a high enough number of clusters this would be a Very Hard Problem, but given that we're in the low thousands, I'm hoping there are some things that could be done to improve performance without resorting to exotic methods.
Please let me know if there's any information I can provide that could help!