Closed rbtz-openai closed 6 months ago
Not an expert here, but from https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/dynamic_forward_proxy_filter#statistics
I think envoy_dns_cache_dynamic_forward_proxy_cache_config_num_hosts{} 6752
is saying we have 6752 hosts in the cache. We have 10k set as the limit, so the cache should be forcing eviction due to lack of space and have capacity to grow as you see above -- so I don't think this is a "memory leak".
@yanjunxiang-google @alyssawilk may know more.
The default max_hosts is 1k. We need some calculation to estimate how much memory is needed for DnsCacheImpl and DFP cluster data structure for 6752 hosts. These hosts will also do DNS refreshing periodically. So, it's probably a good idea to tweak the dns_refresh_rate and min_dns_refresh_rate in dns_cache.proto to slow down that.
@yanjunxiang-google @KBaichoo in a steady state (the inbound load is more-or-less constant) it consumes less then 1GiB of ram. Then something drastically changes and memory starts growing linearly at 10MiB/s until OOM.
Here is how it usually looks like when a host switches from "good" to "bad", and eventually OOMs:
Also to consume 10GiB of memory for 10k hosts would imply 1MiB per entry which does not sound too realistic. For example, a "good" host has ~6000 envoy_dns_cache_dynamic_forward_proxy_cache_config_num_hosts
and with RSS about 400 MiB. That is about 64 KiB per entry which is way more believable.
@rbtz-openai Could you please share the Envoy trace logs when this is happening?
@yanjunxiang-google This envoy is under relatively high load, so sharing full logs would be complicated, maybe we can start with something simpler? e.g. based on the diff heap profile memory leak happens somewhere around dns/cluster interraction: profile001
Here is grep -E 'cluster|dns' envoy.admin.log
for around 10s of traffic: envoy.public.log
If that is not enough, i can try to bleed traffic of the host and see what it does w/o any inbound load.
seems it's leaking in the mutable_cross_priority_host_map_
, according to the heap profile.
@rbtz-openai What is the time difference between the two profile files? it leaking 100 MB during the two profile, does it match 20 minutes leaking 10GB?
@doujiang24 yep, the time difference was about 10s between profiles (leak rate was about 10MiB/s)
Okay, I see. Also, seems DNS refresh frequently, according to the trace log. I'll try to reproduce it on my side.
Okay, I reproduced a leaking case on my side. Here are the key logs, with comments:
# ip changed, host_info_ is also changed
[2023-12-08 22:36:00.952][83629][debug][forward_proxy] [source/extensions/common/dynamic_forward_proxy/dns_cache_impl.cc:404] host 'test.s
ervice:80' address has changed from 210.209.160.193:80 to 166.137.92.127:80
# expired, removing the new host_info_, the old address is leaking
[2023-12-08 22:36:05.952][83629][debug][forward_proxy] [source/extensions/common/dynamic_forward_proxy/dns_cache_impl.cc:262] host='test.s
ervice:80' TTL expired, removing
This leaking happens in MainPrioritySetImpl::updateCrossPriorityHostMap
, matches the info in the profile.
For now, you could try the new sub_cluster_config
:
https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/dynamic_forward_proxy/v3/dynamic_forward_proxy.proto#envoy-v3-api-field-extensions-filters-http-dynamic-forward-proxy-v3-filterconfig-sub-cluster-config
It will create sub strict_dns clusters for each host, with TTL enabled, instead of logical_dns.
Here is a simple example that works on my side.
static_resources:
listeners:
- name: listener_0
address:
socket_address:
protocol: TCP
address: 0.0.0.0
port_value: 7010
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
- match:
prefix: "/"
route:
cluster: dynamic_forward_proxy_cluster
http_filters:
- name: envoy.filters.http.dynamic_forward_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.dynamic_forward_proxy.v3.FilterConfig
sub_cluster_config:
cluster_init_timeout: 10s
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: dynamic_forward_proxy_cluster
connect_timeout: 1s
lb_policy: CLUSTER_PROVIDED
dns_resolvers:
- socket_address:
address: 127.0.0.1
port_value: 5300
cluster_type:
name: envoy.clusters.dynamic_forward_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.clusters.dynamic_forward_proxy.v3.ClusterConfig
sub_clusters_config:
lb_policy: round_robin
max_sub_clusters: 100
sub_cluster_ttl: 10s
Also, I'll try to create a PR for fixing the leaking in the logical_dns implementation.
Small update:
We had a seemingly unrelated issue with brotli decompressor in our config that occasionally got into an infinite loop inside a wrk:worker_X
thread. Removing brotli decompression block[1] also somehow fixed the runaway memory usage spikes. There is still some gradual memory usage increase but it is likely due to our high number of hosts.
[1]
- name: envoy.filters.http.decompressor
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.decompressor.v3.Decompressor
decompressor_library:
name: default
typed_config:
"@type": "type.googleapis.com/envoy.extensions.compression.brotli.decompressor.v3.Brotli"
request_direction_config:
common_config:
enabled:
default_value: false
runtime_key: request_decompressor_enabled
wow, thanks for your update, seems the brotli decompression is the real reason in your case. I'll recheck the memory leaking issue that I founded in dfp.
update: https://github.com/envoyproxy/envoy/pull/31433 fix a memory leaking that matches the profile.
@rbtz-openai Could you please reconfirm that the dfp leak can be ignored in your production case? We are thinking reverting the bugfix in https://github.com/envoyproxy/envoy/pull/31433
sorry for the late reply, yeah, after removing brotli, dfp leak, if it exists, is not measurable at our scale.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Description: We have an envoy in a dynamic forward proxy mode. It usually behaves well, but sometimes it gets into a state where it tries to allocate 10s MiB/s and eventually gets OOM'ed:
Full heap profiles show that most of the memory is inside
DnsCacheImpl
(it should be limited to 10000 entries in theory though).Heap dumps Diff heap profile:
Raw profiles: https://gist.github.com/rbtz-openai/13d35aea14013f12273c6aa7478184cb
Admin and Stats Output:
There are a lot of cluster due to dynamic forward proxy mode:
Config:
The interesting part of the config is the dynamic forward proxy with DNS Cache (same configuration in
http_filters
)cc: @alyssawilk @euroelessar