envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.06k stars 4.82k forks source link

Envoy and control-plane xDS over ADS protocol deadlock when filtering stats #9856

Open dfjones opened 4 years ago

dfjones commented 4 years ago

Description: When using a stats_matcher inclusion_list, Envoy using the ADS variant of the xDS protocol will fail to receive cluster updates from the control-plane. After receiving an initial Cluster list, Envoy will no longer receive cluster updates. When paired with envoyproxy/go-control-plane in ADS mode, this can cause EDS updates to stall if the cluster list is later expanded.

Repro steps: A docker-compose reproduction of the issue with Envoy 1.12.2 and 1.13.0

Relevant Links:

mattklein123 commented 4 years ago

From a drive by, this falls into the general category of bugs in which we are using stats for control flow, which we need to stop doing.

dfjones commented 4 years ago

@mattklein123 Do you think until these bugs are resolved a warning in the docs is warranted? This bit us pretty hard in production and then took a lot of investigation before we traced it down to the stats filtering.

mattklein123 commented 4 years ago

@mattklein123 Do you think until these bugs are resolved a warning in the docs is warranted? This bit us pretty hard in production and then took a lot of investigation before we traced it down to the stats filtering.

Yes, if you want to do a PR that would be good. Do you know which stat causes the issue here? That would cut down on investigation time for whoever fixes this.

dfjones commented 4 years ago

Sure thing, I'll see if I can narrow down what stat caused the issue.

dfjones commented 4 years ago

I've traced it down to something in the cluster.* stats. However, I haven't yet figured out what particular stat under cluster is necessary. cluster.*.version seemed like an obvious guess, but if this is the only one present the bug persists. Perhaps its a combo?

Edit: here's a branch in my repro case where I'm trying to narrow down the stats that cause the issue: https://github.com/dfjones/envoy-stats-filter-repro/tree/min-stats-repro

mattklein123 commented 4 years ago

This might be a dup of https://github.com/envoyproxy/envoy/issues/8771

dfjones commented 4 years ago

I took a look and this does appear to be a dup of #8771. In fact, this comment tipped me off and I found that I can reproduce the issue by just excluding:

suffix: "warming_clusters"

I've updated my repo to show this: https://github.com/dfjones/envoy-stats-filter-repro/tree/min-stats-repro

mpuncel commented 4 years ago

I assume since this issue is open, there hasn't been a fix? We've encountered a similar sounding deadlock, and realized we are not including cluster_manager stats in our inclusion list.

This code also seems suspicious since it's relying on stats and it looks like it will never resume CDS if cm_stats_.warming_clusters_.value() is always zero.

void ClusterManagerImpl::updateClusterCounts() {
  // This if/else block implements a control flow mechanism that can be used by an ADS
  // implementation to properly sequence CDS and RDS updates. It is not enforcing on ADS. ADS can
  // use it to detect when a previously sent cluster becomes warm before sending routes that depend
  // on it. This can improve incidence of HTTP 503 responses from Envoy when a route is used before
  // it's supporting cluster is ready.
  //
  // We achieve that by leaving CDS in the paused state as long as there is at least
  // one cluster in the warming state. This prevents CDS ACK from being sent to ADS.
  // Once cluster is warmed up, CDS is resumed, and ACK is sent to ADS, providing a
  // signal to ADS to proceed with RDS updates.
  // If we're in the middle of shutting down (ads_mux_ already gone) then this is irrelevant.
  if (ads_mux_) {
    const auto type_urls = Config::getAllVersionTypeUrls<envoy::config::cluster::v3::Cluster>();
    const uint64_t previous_warming = cm_stats_.warming_clusters_.value();
    if (previous_warming == 0 && !warming_clusters_.empty()) {
      resume_cds_ = ads_mux_->pause(type_urls);
    } else if (previous_warming > 0 && warming_clusters_.empty()) {
      ASSERT(resume_cds_ != nullptr);
      resume_cds_.reset();
    }
  }
  cm_stats_.active_clusters_.set(active_clusters_.size());
  cm_stats_.warming_clusters_.set(warming_clusters_.size());
}

That said, I'm not sure if this explains what we've seen or not, because we definitely get CDS updates after the initial load, but I'm not sure how it's working.

mattklein123 commented 4 years ago

Yeah someone needs to look into this. Anywhere we are using stats for control flow is broken.

dfjones commented 4 years ago

Our needs for stats filtering was to reduce the amount of data we emit to our stats collection service (prometheus). So, if the filtering was to affect just the output of the stats endpoints, it would be sufficient for our needs. Is there a compelling reason to disable the stats values even for use within Envoy? Is it for performance?

mattklein123 commented 4 years ago

It's just the way the implementation works, also memory saving within Envoy.

dastbe commented 3 years ago

just sanity checking for myself, but for the above given that cluster manager exists for the lifetime of envoy there's no reason that previous_warming couldn't just be tracked on the cluster manager directly, right? i'd be happy to go and fix this if so.