envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.95k stars 4.8k forks source link

CDS Updates with many clusters often fail #12138

Open Starblade42 opened 4 years ago

Starblade42 commented 4 years ago

https://github.com/envoyproxy/envoy/blob/2966597391b9c7743dab1186f214229ca95e0243/source/common/upstream/cds_api_impl.cc#L52-L101

This is a performance issue, not a bug per se.

When doing CDS updates with many clusters, Envoy will often get "stuck" evaluating the CDS update. This manifests as EDS failing, and in more extreme cases, the envoy ceases to receive any XDS updates. When this happens, Envoy needs to be restarted to get it updating again.

In our case we're seeing issues with the current implementation of void CdsApiImpl::onConfigUpdate with a number of clusters in the 3000-7000 range. If envoy could speedily evaluate a CDS update with 10000 clusters this would represent a HUGE improvement in Envoy's behavior for us. Right now, only around 2500 clusters in a CDS update seems to evaluate in reasonable amount of time.

Because the function void CdsApiImpl::onConfigUpdate pauses EDS while doing CDS evaluation, envoy's config will drift. With many clusters in CDS, this can mean envoy is hundreds of seconds behind what is current, and results in 503's.

Some context:

Contour currently doesn't use the incremental APIs of Envoy, so when K8s services change in the cluster it sends ALL of the current config again to Envoy, which means a small change like adding or removing a K8s SVC which maps to an envoy cluster results in Envoy having to re-evaluate ALL clusters.

With enough K8s Services behind an Ingress (7000+) envoy can spontaneously cease to receive any new updates indefinitely, and will fail to do EDS because it gets stuck in the CDS evaluation.

Given a high enough number of clusters this would be a Very Hard Problem, but given that we're in the low thousands, I'm hoping there are some things that could be done to improve performance without resorting to exotic methods.

Please let me know if there's any information I can provide that could help!

snowp commented 4 years ago

Are these clusters changing on a regular basis? CDS updates are known to be somewhat expensive (though I wouldn't expect them to block the main thread for as long as you're describing). If most of the clusters are the same then each update should just be a hash of the proto + compare with existing, which still isn't free but >100s update latency sounds like a lot.

I'm not sure if we've been looking much into optimizing CDS cluster load time (maybe @jmarantz knows?), but the biggest improvement you'll get is most likely from using Delta CDS, where only the updated clusters are sent instead of the entire state of the world.

Starblade42 commented 4 years ago

Yeah, my understanding is the Project Contour doesn’t use this API yet and that there would be significant work required to make use of the delta API.

I’m also a bit unclear on the underlying details, but however the XDS apis are being used in Contour, it seems like some kinds of typical/normal changes in the state of the kubernetes cluster objects prompt a CDS evaluation, if not any real update, because a cluster that’s in steady state as far as objects goes (but not necessarily pods/endpoints) can cause envoy to get deadlocked like I described with 7000+ clusters.

I believe that the expensiveness of the CDS and possibly something in the gRPC sessions are combining to cause the issue.

On Fri, Jul 17, 2020 at 6:13 PM Snow Pettersen notifications@github.com wrote:

Are these clusters changing on a regular basis? CDS updates are known to be somewhat expensive (though I wouldn't expect them to block the main thread for as long as you're describing). If most of the clusters are the same then each update should just be a hash of the proto + compare with existing, which still isn't free but >100s update latency sounds like a lot.

I'm not sure if we've been looking much into optimizing CDS cluster load time (maybe @jmarantz https://github.com/jmarantz knows?), but the biggest improvement you'll get is most likely from using Delta CDS, where only the updated clusters are sent instead of the entire state of the world.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/12138#issuecomment-660389242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOT7S7TAYWLWYHLDS6UU3DR4DSK7ANCNFSM4O46FJUQ .

-- -Jonathan Huff

ramaraochavali commented 4 years ago

CDS cluster initialization is definitely taking time. For 800 clusters, we have seen 20s to initialize - while Delta CDS helps to update new clusters, this slowness is actually affecting the envoy readiness during startup.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

alandiegosantos commented 4 years ago

A similar behaviour happened with some of our nodes. The node I investigated received 280 clusters and removed 1, as shown in this log:

[2020-10-12 18:00:14.160][54554][info][upstream] [external/envoy/source/common/upstream/cds_api_impl.cc:77] cds: add 280 cluster(s), remove 1 cluster(s)

And, as you can see, the last CDS update time is approximately to that time:

# TYPE envoy_cluster_manager_cds_update_time gauge
envoy_cluster_manager_cds_update_time{envoy_dc="someplace"} 1602518414207

After that, the pending EDS responses were processed and Envoy got stuck. This issue started since we started to using the GRPC to communicate with our Controlplane. We used REST before that.

The issue only solves when we restart Envoy. We are using version 1.14.5.

EDIT: This behaviour was detected around 13h on October 13th.

EDIT 2: The issue was caused by closing the connection between Envoy and Controlplane. In this context, we haven't enabled TCP keepalive on connections to Controlplane. So, when the firewall drops the packets from this TCP connection, Envoy was not able to detect that and waited indefinitely for the xDS response from Controlplane.

Thanks

ramaraochavali commented 4 years ago

@htuch @mattklein123 can we please reopen this? This is still an issue

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

htuch commented 3 years ago

We have similar understanding around cluster scalability. I think it would be helpful to have some flamegraphs from folks where possible so we can validate whether these are the same root cause. @pgenera is working on landing https://github.com/envoyproxy/envoy/pull/14167 to benchmark. We have known overheads in stats processing that @jmarantz has landed some recent PRs to address, see https://github.com/envoyproxy/envoy/pull/14028 and https://github.com/envoyproxy/envoy/pull/14312.

jmarantz commented 3 years ago

See also #14439 . Moreover, for any issues with CDS update performance, flame-graphs would be most helpful!

rojkov commented 3 years ago

I experimented a bit with a fake xDS server and tried to update 10000 routes containing RE2-based matching rules. It took 1-2 secs to compile them all. Then I tried to load a STRICT_DNS cluster with 5000 endpoints and it took more than 20 secs on my machine.

The flamegraph points to Envoy::Upstream::StrictDnsClusterImpl::ResolveTarget constructor as the main offender: cds-update Turned out that every ResolveTarget creates also a full copy of envoy::config::endpoint::v3::LocalityLbEndpoints.

The fix seems to be trivial, but I wonder if the reporters use STRICT_DNS as well. @Starblade42, @ramaraochavali can you confirm? If not then could you post a sample config you load with CDS.

jmarantz commented 3 years ago

Nice catch! Any downside to just making that fix?

rojkov commented 3 years ago

The fix I tried is

--- a/source/common/upstream/strict_dns_cluster.h
+++ b/source/common/upstream/strict_dns_cluster.h
@@ -38,7 +38,7 @@ private:
     uint32_t port_;
     Event::TimerPtr resolve_timer_;
     HostVector hosts_;
-    const envoy::config::endpoint::v3::LocalityLbEndpoints locality_lb_endpoint_;
+    const envoy::config::endpoint::v3::LocalityLbEndpoints& locality_lb_endpoint_;
     const envoy::config::endpoint::v3::LbEndpoint lb_endpoint_;
     HostMap all_hosts_;
   };

With it applied Envoy seems to load the update instantly, but //test/common/upstream:upstream_impl_test and //test/integration:hotrestart_test start to fail. So, probably it's not that trivial :) I'll take a look tomorrow.

jmarantz commented 3 years ago

Yeah I was wondering about lifetime issues. In a quick scan it looks like there isn't a ton of data pulled from the endpoints structure in this class; maybe it could be pulled out individually in the ctor rather than copying the whole thing?

htuch commented 3 years ago

A deeper fix might be to move away from using the proto representation for endpoints inside DNS clusters. For convenience, I think we've inherited this view of endpoints that mostly makes sense for static or EDS endpoints, but the merge costs are non-trivial at scale in this case.

rojkov commented 3 years ago

Submitted https://github.com/envoyproxy/envoy/pull/15013 as an attempt to fix it.

ramaraochavali commented 3 years ago

@rojkov Thanks for that fix. But this happens also when majority of the clusters are EDS. So your PR will not fully fix this issue.

rojkov commented 3 years ago

@ramaraochavali Alright, I'll drop the close tag from the PR's description to keep this issue open for now.

rojkov commented 3 years ago

I've played a bit more with CDS and EDS: loaded 10000 clusters with EDS load assignments. For Envoy compiled with -c opt it took about 15 secs. In case of -c dbg - about 150 secs. So, compilation options can be very impactful.

But anyway it seems in the process of a CDS update there are three phases:

  1. stats generation (~80 of 150 sec),
  2. preparing to send gRPC discovery requests for EDS (~60 of 150 secs).
  3. sending the prepared requests to xDS and opening 10000 gRPC streams for EDS (~10 of 150 secs).

The CPU profiles for these phases are very different. Here's the first phase.

tag-extraction

By the end of this phase Envoy consumes about 500M.

Then the second phase starts

after-tag-extraction

Along this phase Envoy starts to consume memory faster until it allocates about 3800M. Here's the heap profile just before the third phase.

mem-state-before-grpc-streams

Then Envoy sends the request to my fake ADS server. At this point I see a burst in the network traffic: about 450M is sent from Envoy to ADS. After that Envoy deallocates memory, its consumption drops back to 500M.

So, perhaps in congested cluster networks the third phase may take longer.

Adding a single cluster to these 10000 in a new CDS update doesn't seem to put any significant load on CPU (at least comparing with the initial loading of 10000).

jmarantz commented 3 years ago

Thanks for collecting these graphs. I assume from the numbers above that these were collected with. "-c dbg", right? Are you reluctant to collect them with "-c opt" because the names in the flame-graphs are less helpful, or you need line info somewhere?

One thing I use sometimes is "OptDebug": --compilation_mode=opt --cxxopt=-g --cxxopt=-ggdb3. That might be worth a shot.

rojkov commented 3 years ago

Oh, thank you! That's helpful. I used "-c dbg" to get more meaningful function names indeed. With the OptDebug options the call stacks are less deep, but still meaningful:

tag-extraction-c-opt

htuch commented 3 years ago

There's some pretty strange stuff in the pprof heap profile. E.g. Node MergeFrom is allocating 2GB? What's going on there? :-D

lambdai commented 3 years ago

There's some pretty strange stuff in the pprof heap profile. E.g. Node MergeFrom is allocating 2GB? What's going on there? :-D

Maybe the image is rendering "alloc_space" (+X bytes for malloc(x), +0 for free()). @rojkov can you confirm?

rojkov commented 3 years ago

Maybe the image is rendering "alloc_space" (+X bytes for malloc(x), +0 for free()). @rojkov can you confirm?

No, that's "inuse".

This is how the profiler log looked like:

Dumping heap profile to /tmp/orig.heapprof.0001.heap (100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0002.heap (200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0003.heap (300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0004.heap (400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0005.heap (500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0006.heap (600 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0007.heap (700 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0008.heap (800 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0009.heap (900 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0010.heap (1000 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0011.heap (1100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0012.heap (1200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0013.heap (1300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0014.heap (1400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0015.heap (1500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0016.heap (1600 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0017.heap (1700 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0018.heap (1800 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0019.heap (1900 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0020.heap (2000 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0021.heap (2100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0022.heap (2200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0023.heap (2300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0024.heap (2400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0025.heap (2500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0026.heap (2600 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0027.heap (2700 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0028.heap (2800 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0029.heap (2900 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0030.heap (3000 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0031.heap (3100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0032.heap (3200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0033.heap (3300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0034.heap (3400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0035.heap (3500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0036.heap (8610 MB allocated cumulatively, 3538 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0037.heap (3638 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0038.heap (3738 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0039.heap (3838 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0040.heap (3938 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0041.heap (4038 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0042.heap (10176 MB allocated cumulatively, 3739 MB currently in use) <- Here one can see a spike in network traffic from Envoy to xDS
Dumping heap profile to /tmp/orig.heapprof.0043.heap (11207 MB allocated cumulatively, 543 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0044.heap (12231 MB allocated cumulatively, 546 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0045.heap (13263 MB allocated cumulatively, 543 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0046.heap (14287 MB allocated cumulatively, 546 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0047.heap (15319 MB allocated cumulatively, 543 MB currently in use)
rojkov commented 3 years ago

New findings:

  1. when Envoy restarts it triggers this code path upon CDS initialization

    // Remove the previous cluster before the cluster object is destroyed.
    secondary_init_clusters_.remove_if(
        [name_to_remove = cluster.info()->name()](ClusterManagerCluster* cluster_iter) {
          return cluster_iter->cluster().info()->name() == name_to_remove;
        });
    secondary_init_clusters_.push_back(&cm_cluster);

    Here secondary_init_clusters_ is std::list. In case Envoy loads 10k clusters this secondary_init_clusters_.remove_if() line takes 4 sec, in case of 30k it takes about 70 sec. Probably hash map would be a better choice here.

  2. Hashes for every Cluster message are calculated twice. The first time is here and the second time is here. Though performance impact is not that huge. The initial hashing of 30k messages takes 400msec. When an update arrives it takes about 90msec for the same 30k messages.

Starblade42 commented 3 years ago

That’s excellent work! It would be awesome if a data structure change could bring down the operational time by a lot.

yanavlasov commented 3 years ago

You can use the --ignore-unknown-dynamic-fields command line flag to disable detection of deprecated fields in the config. This will remove the cost of VersionUtil::scrubHiddenEnvoyDeprecated. It also looks like your control plane uses v2 API. If so converting to v3 should eliminate VersionConverter::... costs.

rojkov commented 3 years ago

I was curious why an update with 10k EDS clusters turns into >100M traffic from Envoy to the server and sniffed it with Wireshark. So, every EDS discovery request for a cluster includes a copy of Node and that adds up ~11k: 10k * 11k ~= 110M.

wireshark-cds

AFAIU the Node message is needed to identify the client. I tried to limit the amount of data included in a request to the Node's id and cluster fields. It reduced the traffic to 6M:

@@ -63,13 +64,20 @@ void GrpcMuxImpl::sendDiscoveryRequest(const std::string& type_url) {

   if (api_state.must_send_node_ || !skip_subsequent_node_ || first_stream_request_) {
     // Node may have been cleared during a previous request.
-    request.mutable_node()->CopyFrom(local_info_.node());
+    envoy::config::core::v3::Node n;
+    n.set_id(local_info_.node().id());
+    n.set_cluster(local_info_.node().cluster());
+    request.mutable_node()->CopyFrom(n);
     api_state.must_send_node_ = false;
   } else {
@@ -93,7 +103,11 @@ GrpcMuxWatchPtr GrpcMuxImpl::addWatch(const std::string& type_url,
   // TODO(gsagula): move TokenBucketImpl params to a config.
   if (!apiStateFor(type_url).subscribed_) {
     apiStateFor(type_url).request_.set_type_url(type_url);
-    apiStateFor(type_url).request_.mutable_node()->MergeFrom(local_info_.node());
+    envoy::config::core::v3::Node n;
+    n.set_id(local_info_.node().id());
+    n.set_cluster(local_info_.node().cluster());
+    apiStateFor(type_url).request_.mutable_node()->MergeFrom(n);
     apiStateFor(type_url).subscribed_ = true;
     subscriptions_.emplace_back(type_url);
     if (enable_type_url_downgrade_and_upgrade_) {

Also it helped to slash a couple of seconds off GrpcMuxImpl::sendDiscoveryRequest():

With full Node:

Duration(us)  # Calls  Mean(ns)  StdDev(ns)  Min(ns)  Max(ns)  Category             Description
     2332069    40020     58272     13650.1    19552   382876  done                 GrpcMuxImpl::sendDiscoveryRequest
      291161    10006     29098      5528.3    16645   184649  done                 GrpcMuxImpl::addWatch

With limited Node:

Duration(us)  # Calls  Mean(ns)  StdDev(ns)  Min(ns)  Max(ns)  Category             Description
      225534    40020      5635     3212.79     2367   124869  done                 GrpcMuxImpl::sendDiscoveryRequest
       22091    10006      2207     1740.46      974    81893  done                 GrpcMuxImpl::addWatch

Looks like Envoy is bound to perform slightly worse with every new release as the number of extensions enumerated in Node increases. Can we announce node capabilities only once per gRPC session?

htuch commented 3 years ago

@rojkov this is already supported for SotW (but not delta) xDS, see https://github.com/envoyproxy/envoy/blob/07c4c17be61c77d87d2c108b0775f2e606a7ae12/api/envoy/config/core/v3/config_source.proto#L107.

I'm thinking this is something else we would prefer default true (but need to do a deprecation dance to move to BoolValue).

@adisuissa this might be another potential cause of buffer bloat in the issue you are looking at.