Reduce used cluster memory by initialising http_async_client_ lazily

Title: Reduce used cluster memory by initialising http_asyncclient lazily

Description: I've been trying to put my envoy on a diet and noticed that most (if not all) of my clusters do not seemingly use httpAsyncClient() but I am paying a price for it in terms of used memory. I have 2500 TLS clusters and by default I end up with about ~249.7m used memory in idle state. If I were to make http_asyncclient lazy given the below patch I end up with ~209.7m used memory which is a huge difference. Below is the patch I used for measuring though I haven't tested it yet if it works for clusters that actually need the httpAsyncClient().

I'll look around and see what else can be initialised lazily.

diff --git a/source/common/upstream/cluster_manager_impl.cc b/source/common/upstream/cluster_manager_impl.cc
index 703c1b7..791b917 100644
--- a/source/common/upstream/cluster_manager_impl.cc
+++ b/source/common/upstream/cluster_manager_impl.cc
@@ -1048,7 +1048,16 @@ Host::CreateConnectionData ClusterManagerImpl::ThreadLocalClusterManagerImpl::Cl

 Http::AsyncClient&
 ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::httpAsyncClient() {
-  return http_async_client_;
+  if (!http_async_client_) {
+      http_async_client_ =
+          std::make_unique<Http::AsyncClientImpl>(
+              cluster_info_, parent_.parent_.stats_, parent_.thread_local_dispatcher_,
+              parent_.parent_.local_info_, parent_.parent_, parent_.parent_.runtime_,
+              parent_.parent_.random_,
+              Router::ShadowWriterPtr{new Router::ShadowWriterImpl(parent_.parent_)},
+              parent_.parent_.http_context_, parent_.parent_.router_context_);
+  }
+  return *http_async_client_;
 }

 ClusterUpdateCallbacksHandlePtr
@@ -1346,12 +1355,7 @@ ClusterManagerImpl::ThreadLocalClusterManagerImpl::getHttpConnPoolsContainer(
 ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::ClusterEntry(
     ThreadLocalClusterManagerImpl& parent, ClusterInfoConstSharedPtr cluster,
     const LoadBalancerFactorySharedPtr& lb_factory)
-    : parent_(parent), lb_factory_(lb_factory), cluster_info_(cluster),
-      http_async_client_(cluster, parent.parent_.stats_, parent.thread_local_dispatcher_,
-                         parent.parent_.local_info_, parent.parent_, parent.parent_.runtime_,
-                         parent.parent_.random_,
-                         Router::ShadowWriterPtr{new Router::ShadowWriterImpl(parent.parent_)},
-                         parent_.parent_.http_context_, parent_.parent_.router_context_) {
+    : parent_(parent), lb_factory_(lb_factory), cluster_info_(cluster) {
   priority_set_.getOrCreateHostSet(0);

   // TODO(mattklein123): Consider converting other LBs over to thread local. All of them could
diff --git a/source/common/upstream/cluster_manager_impl.h b/source/common/upstream/cluster_manager_impl.h
index 2107ec2..0ff4fdd 100644
--- a/source/common/upstream/cluster_manager_impl.h
+++ b/source/common/upstream/cluster_manager_impl.h
@@ -431,7 +431,7 @@ private:
       // Current active LB.
       LoadBalancerPtr lb_;
       ClusterInfoConstSharedPtr cluster_info_;
-      Http::AsyncClientImpl http_async_client_;
+      Http::AsyncClientPtr http_async_client_;
     };

     using ClusterEntryPtr = std::unique_ptr<ClusterEntry>;

I will unfortunately will not be able to submit this patch myself, at least not from this account, and I'd have to go through some bureaucracy if I am to submit any code in this repository, so for the time being please do not assign this issue to me and ask if someone else is interested in making these changes.

In the meantime I took another stab and noticed other things that can be initialised lazily, in particular I noticed that we had a ton of std::_Sp_counted_ptr_inplace<bool, std::allocator<bool> allocations and my clusters have neither priority updates nor someone has registered for member updates. The allocations in question actually come from CallbackManager: const std::shared_ptr<bool> still_alive_{std::make_shared<bool>(true)}; and in particular from PrioritySetImpl members host_sets_priority_updatecbs, member_update_cbhelper and priority_update_cbhelper.

If I apply some lazy initialisation which I'll attach at the end of the comment I end up with:

Unsigned chunks have 872063 instances taking 0x3fbbe28(66,829,864) bytes.
Signature 1cf5b80 (std::_Sp_counted_ptr_inplace<bool, std::allocator<bool>, (__gnu_cxx::_Lock_policy)2>) has 90081 instances taking 0x2c1040(2,887,744) bytes.
Signature 1d06d50 (Envoy::Upstream::HostSetImpl) has 85034 instances taking 0x1611a90(23,141,008) bytes.
Signature 1cfb040 (Envoy::Upstream::ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry) has 82533 instances taking 0xa13680(10,565,248) bytes.
Signature 1cea770 (Envoy::Extensions::Clusters::XXX::LoadBalancer) has 82500 instances taking 0x284880(2,640,000) bytes.

The remaining std::_Sp_counted_ptr_inplace<bool, std::allocator<bool> instances are probably callbacks for cluster updates which we can probably live with for the time being. The HostSetImpl(s) are still fat and I haven't tried lazy initialising anything in them yet. ClusterEntry(s) have been significantly reduced from 39,615,840 to 10,565,248 bytes and unsigned chunks have dropped with both patches a lot. The XXX::LoadBalancer allocations bother me a bit as they are all instances of a custom stateless LoadBalancer for which I end up having a lot of allocations for but only need one.

The second patch:

diff --git a/source/common/upstream/upstream_impl.cc b/source/common/upstream/upstream_impl.cc
index 4f2beb1..a9f096e 100644
--- a/source/common/upstream/upstream_impl.cc
+++ b/source/common/upstream/upstream_impl.cc
@@ -548,11 +548,13 @@ PrioritySetImpl::getOrCreateHostSet(uint32_t priority,
   if (host_sets_.size() < priority + 1) {
     for (size_t i = host_sets_.size(); i <= priority; ++i) {
       HostSetImplPtr host_set = createHostSet(i, overprovisioning_factor);
-      host_sets_priority_update_cbs_.push_back(
-          host_set->addPriorityUpdateCb([this](uint32_t priority, const HostVector& hosts_added,
-                                               const HostVector& hosts_removed) {
-            runReferenceUpdateCallbacks(priority, hosts_added, hosts_removed);
-          }));
+      if (lazy_) {
+        lazy_->host_sets_priority_update_cbs_.push_back(
+            host_set->addPriorityUpdateCb([this](uint32_t priority, const HostVector& hosts_added,
+                                                 const HostVector& hosts_removed) {
+              runReferenceUpdateCallbacks(priority, hosts_added, hosts_removed);
+            }));
+      }
       host_sets_.push_back(std::move(host_set));
     }
   }
diff --git a/source/common/upstream/upstream_impl.h b/source/common/upstream/upstream_impl.h
index dfe0090..4a9c853 100644
--- a/source/common/upstream/upstream_impl.h
+++ b/source/common/upstream/upstream_impl.h
@@ -463,11 +463,25 @@ public:
   // From PrioritySet
   ABSL_MUST_USE_RESULT Common::CallbackHandlePtr
   addMemberUpdateCb(MemberUpdateCb callback) const override {
-    return member_update_cb_helper_.add(callback);
+    if (!member_update_cb_helper_) {
+      member_update_cb_helper_ =
+          std::make_unique<Common::CallbackManager<const HostVector&, const HostVector&>>();
+    }
+    return member_update_cb_helper_->add(callback);
   }
   ABSL_MUST_USE_RESULT Common::CallbackHandlePtr
   addPriorityUpdateCb(PriorityUpdateCb callback) const override {
-    return priority_update_cb_helper_.add(callback);
+    if (!lazy_) {
+      lazy_ = std::make_unique<Lazy>();
+      for (auto& host_set: host_sets_) {
+        lazy_->host_sets_priority_update_cbs_.push_back(
+            static_cast<HostSetImpl*>(host_set.get())->addPriorityUpdateCb([this](
+                uint32_t priority, const HostVector& hosts_added, const HostVector& hosts_removed) {
+              runReferenceUpdateCallbacks(priority, hosts_added, hosts_removed);
+            }));
+      }
+    }
+    return lazy_->priority_update_cb_helper_.add(callback);
   }
   const std::vector<std::unique_ptr<HostSet>>& hostSetsPerPriority() const override {
     return host_sets_;
@@ -493,11 +507,15 @@ protected:

 protected:
   virtual void runUpdateCallbacks(const HostVector& hosts_added, const HostVector& hosts_removed) {
-    member_update_cb_helper_.runCallbacks(hosts_added, hosts_removed);
+    if (member_update_cb_helper_) {
+      member_update_cb_helper_->runCallbacks(hosts_added, hosts_removed);
+    }
   }
   virtual void runReferenceUpdateCallbacks(uint32_t priority, const HostVector& hosts_added,
-                                           const HostVector& hosts_removed) {
-    priority_update_cb_helper_.runCallbacks(priority, hosts_added, hosts_removed);
+                                           const HostVector& hosts_removed) const {
+    if (lazy_) {
+      lazy_->priority_update_cb_helper_.runCallbacks(priority, hosts_added, hosts_removed);
+    }
   }
   // This vector will generally have at least one member, for priority level 0.
   // It will expand as host sets are added but currently does not shrink to
@@ -505,13 +523,18 @@ protected:
   std::vector<std::unique_ptr<HostSet>> host_sets_;

 private:
-  // This is a matching vector to store the callback handles for host_sets_. It is kept separately
-  // because host_sets_ is directly returned so we avoid translation.
-  std::vector<Common::CallbackHandlePtr> host_sets_priority_update_cbs_;
-  // TODO(mattklein123): Remove mutable.
-  mutable Common::CallbackManager<const HostVector&, const HostVector&> member_update_cb_helper_;
-  mutable Common::CallbackManager<uint32_t, const HostVector&, const HostVector&>
-      priority_update_cb_helper_;
+  mutable std::unique_ptr<Common::CallbackManager<const HostVector&, const HostVector&>> member_update_cb_helper_;
+
+  struct Lazy {
+    // This is a matching vector to store the callback handles for host_sets_. It is kept separately
+    // because host_sets_ is directly returned so we avoid translation.
+    std::vector<Common::CallbackHandlePtr> host_sets_priority_update_cbs_;
+    // TODO(mattklein123): Remove mutable.
+    Common::CallbackManager<uint32_t, const HostVector&, const HostVector&> priority_update_cb_helper_;
+  };
+
+  mutable std::unique_ptr<Lazy> lazy_;
+
   bool batch_update_ : 1;

   // Helper class to maintain state as we perform multiple host updates. Keeps track of all hosts

With a rather tiny patch in HostSetImpl we can end up with:

Unsigned chunks have 874430 instances taking 0x3fd5e18(66,936,344) bytes.
Signature 1d07d10 (Envoy::Upstream::HostSetImpl) has 85034 instances taking 0xf96c30(16,346,160) bytes.
Signature 1cfc140 (Envoy::Upstream::ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry) has 82533 instances taking 0xa13480(10,564,736) bytes.
Signature 1ceb870 (Envoy::Extensions::Clusters::XXX::LoadBalancer) has 82500 instances taking 0x285ea0(2,645,664) bytes.
Signature 1cfcef8 (std::_Sp_counted_ptr_inplace<Envoy::Upstream::HostsPerLocalityImpl, std::allocator<Envoy::Upstream::HostsPerLocalityImpl>, (__gnu_cxx::_Lock_policy)2>) has 10005 instances taking 0x9d070(643,184) bytes.
Signature 1cfced8 (Envoy::Common::CallbackManager<unsigned int, std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > > const&, std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > > const&>::CallbackHolder) has 7635 instances taking 0x961a0(614,816) bytes.
Signature 1cf6c80 (std::_Sp_counted_ptr_inplace<bool, std::allocator<bool>, (__gnu_cxx::_Lock_policy)2>) has 7580 instances taking 0x3d960(252,256) bytes.

The decrease is due to making the following changes in HostSetImpl:

make member_update_cbhelper lazy
wrap healthy_localityentries, healthy_localityscheduler, degraded_localityentries and degraded_localityscheduler in a Lazy struct as not all clusters use locality weights and std::vectors are ... expensive.

Although I don't have a patch for this it seems that simply having lots of shared_ptr members in HostSetImpl increases memory significantly as we have HostSetImpl(s) per cluster per thread/worker. I don't know if it would be possible to just group these members that are passed via PrioritySet::UpdateHostsParams across threads/workers via e.g.

From

  HostVectorConstSharedPtr hosts_;
  HealthyHostVectorConstSharedPtr healthy_hosts_;
  DegradedHostVectorConstSharedPtr degraded_hosts_;
  ExcludedHostVectorConstSharedPtr excluded_hosts_;
  HostsPerLocalityConstSharedPtr hosts_per_locality_{HostsPerLocalityImpl::empty()};
  HostsPerLocalityConstSharedPtr healthy_hosts_per_locality_{HostsPerLocalityImpl::empty()};
  HostsPerLocalityConstSharedPtr degraded_hosts_per_locality_{HostsPerLocalityImpl::empty()};
  HostsPerLocalityConstSharedPtr excluded_hosts_per_locality_{HostsPerLocalityImpl::empty()};

  Struct Hosts {
    HostVectorConstSharedPtr all_; // was hosts_
    HealthyHostVectorConstSharedPtr healthy_; // was healthy_hosts_;
    ... etc

  std::shared_ptr<Hosts> hosts_; // => hosts_->all_

third patch:

diff --git a/source/common/upstream/upstream_impl.cc b/source/common/upstream/upstream_impl.cc
index 4f2beb1..f18280b 100644
--- a/source/common/upstream/upstream_impl.cc
+++ b/source/common/upstream/upstream_impl.cc
@@ -401,14 +401,20 @@ void HostSetImpl::updateHosts(PrioritySet::UpdateHostsParams&& update_hosts_para
   excluded_hosts_per_locality_ = std::move(update_hosts_params.excluded_hosts_per_locality);
   locality_weights_ = std::move(locality_weights);

-  rebuildLocalityScheduler(healthy_locality_scheduler_, healthy_locality_entries_,
-                           *healthy_hosts_per_locality_, healthy_hosts_->get(), hosts_per_locality_,
-                           excluded_hosts_per_locality_, locality_weights_,
-                           overprovisioning_factor_);
-  rebuildLocalityScheduler(degraded_locality_scheduler_, degraded_locality_entries_,
-                           *degraded_hosts_per_locality_, degraded_hosts_->get(),
-                           hosts_per_locality_, excluded_hosts_per_locality_, locality_weights_,
-                           overprovisioning_factor_);
+  if (locality_weights != nullptr && !locality_weights->empty()) {
+    if (!lazy_) {
+      lazy_ = std::make_unique<Lazy>();
+    }
+
+    rebuildLocalityScheduler(lazy_->healthy_locality_scheduler_, lazy_->healthy_locality_entries_,
+                             *healthy_hosts_per_locality_, healthy_hosts_->get(), hosts_per_locality_,
+                             excluded_hosts_per_locality_, locality_weights_,
+                             overprovisioning_factor_);
+    rebuildLocalityScheduler(lazy_->degraded_locality_scheduler_, lazy_->degraded_locality_entries_,
+                             *degraded_hosts_per_locality_, degraded_hosts_->get(),
+                             hosts_per_locality_, excluded_hosts_per_locality_, locality_weights_,
+                             overprovisioning_factor_);
+  }

   runUpdateCallbacks(hosts_added, hosts_removed);
 }
@@ -457,11 +463,11 @@ void HostSetImpl::rebuildLocalityScheduler(
 }

 absl::optional<uint32_t> HostSetImpl::chooseHealthyLocality() {
-  return chooseLocality(healthy_locality_scheduler_.get());
+  return chooseLocality(lazy_ ? lazy_->healthy_locality_scheduler_.get() : nullptr);
 }

 absl::optional<uint32_t> HostSetImpl::chooseDegradedLocality() {
-  return chooseLocality(degraded_locality_scheduler_.get());
+  return chooseLocality(lazy_ ? lazy_->degraded_locality_scheduler_.get() : nullptr);
 }

 absl::optional<uint32_t>

diff --git a/source/common/upstream/upstream_impl.h b/source/common/upstream/upstream_impl.h
index dfe0090..5dc8ede 100644
--- a/source/common/upstream/upstream_impl.h
+++ b/source/common/upstream/upstream_impl.h
@@ -317,8 +317,10 @@ public:
       : priority_(priority), overprovisioning_factor_(overprovisioning_factor.has_value()
                                                           ? overprovisioning_factor.value()
                                                           : kDefaultOverProvisioningFactor),
-        hosts_(new HostVector()), healthy_hosts_(new HealthyHostVector()),
-        degraded_hosts_(new DegradedHostVector()), excluded_hosts_(new ExcludedHostVector()) {}
+        hosts_(std::make_shared<HostVector>()),
+        healthy_hosts_(std::make_shared<HealthyHostVector>()),
+        degraded_hosts_(std::make_shared<DegradedHostVector>()),
+        excluded_hosts_(std::make_shared<ExcludedHostVector>()) {}

   /**
    * Install a callback that will be invoked when the host set membership changes.
@@ -327,7 +329,11 @@ public:
    */
   ABSL_MUST_USE_RESULT Common::CallbackHandlePtr
   addPriorityUpdateCb(PrioritySet::PriorityUpdateCb callback) const {
-    return member_update_cb_helper_.add(callback);
+    if (!member_update_cb_helper_) {
+      member_update_cb_helper_ =
+          std::make_unique<Common::CallbackManager<uint32_t, const HostVector&, const HostVector&>>();
+    }
+    return member_update_cb_helper_->add(callback);
   }

   // Upstream::HostSet
@@ -387,7 +393,9 @@ public:

 protected:
   virtual void runUpdateCallbacks(const HostVector& hosts_added, const HostVector& hosts_removed) {
-    member_update_cb_helper_.runCallbacks(priority_, hosts_added, hosts_removed);
+    if (member_update_cb_helper_) {
+      member_update_cb_helper_->runCallbacks(priority_, hosts_added, hosts_removed);
+    }
   }

 private:
@@ -411,7 +419,7 @@ private:
   HostsPerLocalityConstSharedPtr degraded_hosts_per_locality_{HostsPerLocalityImpl::empty()};
   HostsPerLocalityConstSharedPtr excluded_hosts_per_locality_{HostsPerLocalityImpl::empty()};
   // TODO(mattklein123): Remove mutable.
-  mutable Common::CallbackManager<uint32_t, const HostVector&, const HostVector&>
+  mutable std::unique_ptr<Common::CallbackManager<uint32_t, const HostVector&, const HostVector&>>
       member_update_cb_helper_;
   // Locality weights (used to build WRR locality_scheduler_);
   LocalityWeightsConstSharedPtr locality_weights_;
@@ -446,10 +454,14 @@ private:

   static absl::optional<uint32_t> chooseLocality(EdfScheduler<LocalityEntry>* locality_scheduler);

-  std::vector<std::shared_ptr<LocalityEntry>> healthy_locality_entries_;
-  std::unique_ptr<EdfScheduler<LocalityEntry>> healthy_locality_scheduler_;
-  std::vector<std::shared_ptr<LocalityEntry>> degraded_locality_entries_;
-  std::unique_ptr<EdfScheduler<LocalityEntry>> degraded_locality_scheduler_;
+  struct Lazy {
+    std::vector<std::shared_ptr<LocalityEntry>> healthy_locality_entries_;
+    std::unique_ptr<EdfScheduler<LocalityEntry>> healthy_locality_scheduler_;
+    std::vector<std::shared_ptr<LocalityEntry>> degraded_locality_entries_;
+    std::unique_ptr<EdfScheduler<LocalityEntry>> degraded_locality_scheduler_;
+  };
+
+  std::unique_ptr<Lazy> lazy_;
 };

 using HostSetImplPtr = std::unique_ptr<HostSetImpl>;

I agree this is a great thing to explore. There may be some very easy wins here for many configurations. cc @rojkov @jmarantz

I think that I have a working solution to get rid of almost all stateless LoadBalancers. For the record they are stateless as I use the filterstate in chooseHost. The sample code is below in case anyone else has this problem. The placement new in operator delete is mandatory as otherwise we seem to reset the vptr and crash.

#include <iostream>
#include <memory>

struct Base { // base LoadBalancer
   virtual ~Base() { std::cout << "~Base" << std::endl; }
};

struct Derived : public Base { // custom LoadBalancer
   ~Derived() { std::cout << "~Derived" << std::endl; }

   void operator delete(void *p) { new (p) Derived(); }
};

static thread_local Derived lb;

struct Factory { // LoadBalancer factory
   static std::unique_ptr<Base> create() { // virtual or not, what matters most is the return type
      return std::unique_ptr<Derived>(&lb);
   }
};

int
main()
{
   {
      auto one = Factory::create();
   }
   {
      auto two = Factory::create();
   }
   {
      auto one = Factory::create();
      auto two = Factory::create();
   }
   {
      auto base = std::make_unique<Base>();
   }
}

At a high level this makes sense to me -- we should avoid allocating a ton of memory for features we are not using. What's the usage scenario, though, for not using an http async client? Would we not need those for getting origin data from upstreams? Seems plausible but I just wanted to understand why that would be common.

The tradeoff is complexity/risk/cost due to indirection for those who are using those features. Also note that the lazy pointers may need mutex &/or atomic protection if they are lazily allocated and they are not thread local, which might create bottlenecks. See AtomicPtr and AtomicPtrArray in source/common/common/thread.h for a way around that issue.

It does seem like you are particularly motivated to get this done. What are the roadblocks to you actually driving this yourself? Otherwise you are kind of waiting for interested parties to queue up for this kind of risky work.

Last point: I did not understand the placed-new-in-dtor hack above but I didn't look at it too long. Seems like a weird solution to a problem I hope we can avoid :)

Here are something may be we can have a try. But the most important thing is what @jmarantz said: complexity/risk/cost.

In fact, the current cm/lb/eds code is already very complicated, and it is not a good thing to continue to add new complexity.

It seems that aggregating shared memebers may reduce some memory overhead while simplifying the code. 🤔

If we guaranty that all accesses to a lazy member happens through a getter which controls the state (checks for existence and instantiates the member safely) like in the first patch then complexity can be kept under control.

What's the usage scenario, though, for not using an http async client?

This is the most common usage scenario. In most cases when you only have the router http filter this http client is not used by it. This particular http async client is used in case you have e.g. a filter that does StopIterationAndWatermark, then uses the http async client to make an RPC, and then continues decoding on success.

I believe that none of the patches that were shared require synchronization. If they do perhaps I'd have to rethink them. So far I haven't seen any crashes or issues for my 2500 clusters but I haven't run the most thorough tests. I haven't run any tests with these patches for my other envoy instances as well so something may come up.

It does seem like you are particularly motivated to get this done. What are the roadblocks to you actually driving this yourself? Otherwise you are kind of waiting for interested parties to queue up for this kind of risky work.

I am very motivated to get this done and I am definitely going to get this done in my mirror. I cannot afford to upstream this officially at the moment but would rather have an official patch than having to rebase my local one over and over again (which I am already doing for lots of other patches).

I think that it would be very good to static_assert the size of ClusterEntry, HostSetImpl, PrioritySetImpl and possibly other thread local object instances so that if someone happens to add extra members in them they'd at least be warned and would have to increase the number and provide some reasoning.

Side note - I think that envoy with less memory due to the above patches actually behaves better as now the per-cpu tcmalloc caches (3mb) aren't full of ... mostly const data that isn't allocated/deallocated during proxying which is the most common usage. Not sure if there is a way to move these objects out of the per-cpu caches.

RE "hasn't crashed yet" implies "no threading issues": I would like a more principled reasoning :)

RE "can't afford to upstream this": I feel like the cost of upstreaming is much less than the cost of regularly patching updates and possibly dealing with merge conflicts.

Having said this I'm definitely in favor of memory improvements, but (again) I want to make sure we are optimizing for the common case. Don't we typically need an async http fetch from upstream?

Oh sorry I missed "This is the most common usage scenario. In most cases when you only have the router http filter this http client is not used" as it was formatted like it was part of the question rather than the response.

Oh sorry I missed "This is the most common usage scenario. In most cases when you only have the router http filter this http client is not used" as it was formatted like it was part of the question rather than the response.

My bad.

There are still a few other proposed patches here, should we open separate issues @alyssawilk ?

Sorry about that, reopened

envoyproxy / envoy

Reduce used cluster memory by initialising http_async_client_ lazily #19332