Open dfjones opened 4 years ago
From a drive by, this falls into the general category of bugs in which we are using stats for control flow, which we need to stop doing.
@mattklein123 Do you think until these bugs are resolved a warning in the docs is warranted? This bit us pretty hard in production and then took a lot of investigation before we traced it down to the stats filtering.
@mattklein123 Do you think until these bugs are resolved a warning in the docs is warranted? This bit us pretty hard in production and then took a lot of investigation before we traced it down to the stats filtering.
Yes, if you want to do a PR that would be good. Do you know which stat causes the issue here? That would cut down on investigation time for whoever fixes this.
Sure thing, I'll see if I can narrow down what stat caused the issue.
I've traced it down to something in the cluster.*
stats. However, I haven't yet figured out what particular stat under cluster
is necessary. cluster.*.version
seemed like an obvious guess, but if this is the only one present the bug persists. Perhaps its a combo?
Edit: here's a branch in my repro case where I'm trying to narrow down the stats that cause the issue: https://github.com/dfjones/envoy-stats-filter-repro/tree/min-stats-repro
This might be a dup of https://github.com/envoyproxy/envoy/issues/8771
I took a look and this does appear to be a dup of #8771. In fact, this comment tipped me off and I found that I can reproduce the issue by just excluding:
suffix: "warming_clusters"
I've updated my repo to show this: https://github.com/dfjones/envoy-stats-filter-repro/tree/min-stats-repro
I assume since this issue is open, there hasn't been a fix? We've encountered a similar sounding deadlock, and realized we are not including cluster_manager
stats in our inclusion list.
This code also seems suspicious since it's relying on stats and it looks like it will never resume CDS if cm_stats_.warming_clusters_.value()
is always zero.
void ClusterManagerImpl::updateClusterCounts() {
// This if/else block implements a control flow mechanism that can be used by an ADS
// implementation to properly sequence CDS and RDS updates. It is not enforcing on ADS. ADS can
// use it to detect when a previously sent cluster becomes warm before sending routes that depend
// on it. This can improve incidence of HTTP 503 responses from Envoy when a route is used before
// it's supporting cluster is ready.
//
// We achieve that by leaving CDS in the paused state as long as there is at least
// one cluster in the warming state. This prevents CDS ACK from being sent to ADS.
// Once cluster is warmed up, CDS is resumed, and ACK is sent to ADS, providing a
// signal to ADS to proceed with RDS updates.
// If we're in the middle of shutting down (ads_mux_ already gone) then this is irrelevant.
if (ads_mux_) {
const auto type_urls = Config::getAllVersionTypeUrls<envoy::config::cluster::v3::Cluster>();
const uint64_t previous_warming = cm_stats_.warming_clusters_.value();
if (previous_warming == 0 && !warming_clusters_.empty()) {
resume_cds_ = ads_mux_->pause(type_urls);
} else if (previous_warming > 0 && warming_clusters_.empty()) {
ASSERT(resume_cds_ != nullptr);
resume_cds_.reset();
}
}
cm_stats_.active_clusters_.set(active_clusters_.size());
cm_stats_.warming_clusters_.set(warming_clusters_.size());
}
That said, I'm not sure if this explains what we've seen or not, because we definitely get CDS updates after the initial load, but I'm not sure how it's working.
Yeah someone needs to look into this. Anywhere we are using stats for control flow is broken.
Our needs for stats filtering was to reduce the amount of data we emit to our stats collection service (prometheus). So, if the filtering was to affect just the output of the stats endpoints, it would be sufficient for our needs. Is there a compelling reason to disable the stats values even for use within Envoy? Is it for performance?
It's just the way the implementation works, also memory saving within Envoy.
just sanity checking for myself, but for the above given that cluster manager exists for the lifetime of envoy there's no reason that previous_warming couldn't just be tracked on the cluster manager directly, right? i'd be happy to go and fix this if so.
Description: When using a
stats_matcher
inclusion_list
, Envoy using the ADS variant of the xDS protocol will fail to receive cluster updates from the control-plane. After receiving an initial Cluster list, Envoy will no longer receive cluster updates. When paired with envoyproxy/go-control-plane in ADS mode, this can cause EDS updates to stall if the cluster list is later expanded.Repro steps: A docker-compose reproduction of the issue with Envoy 1.12.2 and 1.13.0
Relevant Links: