envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25k stars 4.81k forks source link

Memory leak in dynamic forward proxy mode w/ DNS Cache #30999

Closed rbtz-openai closed 6 months ago

rbtz-openai commented 11 months ago

Description: We have an envoy in a dynamic forward proxy mode. It usually behaves well, but sometimes it gets into a state where it tries to allocate 10s MiB/s and eventually gets OOM'ed: Screenshot 2023-11-21 at 11 22 01 AM

Full heap profiles show that most of the memory is inside DnsCacheImpl (it should be limited to 10000 entries in theory though).

As a side note, I've also noticed that envoy tries to resolve internal search paths (*.svc.cluster.local.), even though no_default_search_domain is set to true.

Heap dumps Diff heap profile: profile001

Raw profiles: https://gist.github.com/rbtz-openai/13d35aea14013f12273c6aa7478184cb

Admin and Stats Output:

 "version": "b5ca88acee3453c9459474b8f22215796eff4dde/1.28.0/Clean/RELEASE/BoringSSL",

There are a lot of cluster due to dynamic forward proxy mode:

# curl -s localhost:XXX/clusters | fgrep -c hostname
6752
$ curl -s localhost:XXX/stats/prometheus | fgrep dns
# TYPE envoy_dns_cares_get_addr_failure counter
envoy_dns_cares_get_addr_failure{} 310
# TYPE envoy_dns_cares_not_found counter
envoy_dns_cares_not_found{} 178
# TYPE envoy_dns_cares_resolve_total counter
envoy_dns_cares_resolve_total{} 186188
# TYPE envoy_dns_cares_timeouts counter
envoy_dns_cares_timeouts{} 32
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_cache_load counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_cache_load{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_attempt counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_attempt{} 184881
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_failure counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_failure{} 667
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_success counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_success{} 184213
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_timeout counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_timeout{} 365
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_rq_pending_overflow counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_rq_pending_overflow{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_added counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_added{} 33253
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_address_changed counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_address_changed{} 90554
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_overflow counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_overflow{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_removed counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_removed{} 26501
# TYPE envoy_dns_cares_pending_resolutions gauge
envoy_dns_cares_pending_resolutions{} 1
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_open gauge
envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_open{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_remaining gauge
envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_remaining{} 1024
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_num_hosts gauge
envoy_dns_cache_dynamic_forward_proxy_cache_config_num_hosts{} 6752

Config:

The interesting part of the config is the dynamic forward proxy with DNS Cache (same configuration in http_filters)

    cluster_type:
      name: envoy.clusters.dynamic_forward_proxy
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.clusters.dynamic_forward_proxy.v3.ClusterConfig
        dns_cache_config:
          name: dynamic_forward_proxy_cache_config
          max_hosts: 10000
          dns_lookup_family: V4_ONLY
          typed_dns_resolver_config:
            name: envoy.network.dns_resolver.cares
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.network.dns_resolver.cares.v3.CaresDnsResolverConfig
              resolvers:
                - socket_address:
                    address: 8.8.8.8
                    port_value: 53
                - socket_address:
                    address: 1.1.1.1
                    port_value: 53
                - socket_address:
                    address: 8.8.4.4
                    port_value: 53
                - socket_address:
                    address: 1.0.0.1
                    port_value: 53
              dns_resolver_options:
                use_tcp_for_dns_lookups: true
                # There is no need to use the default search domain when resolving external requests
                no_default_search_domain: true

cc: @alyssawilk @euroelessar

KBaichoo commented 11 months ago

Not an expert here, but from https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/dynamic_forward_proxy_filter#statistics

I think envoy_dns_cache_dynamic_forward_proxy_cache_config_num_hosts{} 6752 is saying we have 6752 hosts in the cache. We have 10k set as the limit, so the cache should be forcing eviction due to lack of space and have capacity to grow as you see above -- so I don't think this is a "memory leak".

@yanjunxiang-google @alyssawilk may know more.

yanjunxiang-google commented 11 months ago

The default max_hosts is 1k. We need some calculation to estimate how much memory is needed for DnsCacheImpl and DFP cluster data structure for 6752 hosts. These hosts will also do DNS refreshing periodically. So, it's probably a good idea to tweak the dns_refresh_rate and min_dns_refresh_rate in dns_cache.proto to slow down that.

rbtz-openai commented 11 months ago

@yanjunxiang-google @KBaichoo in a steady state (the inbound load is more-or-less constant) it consumes less then 1GiB of ram. Then something drastically changes and memory starts growing linearly at 10MiB/s until OOM.

Here is how it usually looks like when a host switches from "good" to "bad", and eventually OOMs:

Screenshot 2023-11-22 at 1 27 17 PM

Also to consume 10GiB of memory for 10k hosts would imply 1MiB per entry which does not sound too realistic. For example, a "good" host has ~6000 envoy_dns_cache_dynamic_forward_proxy_cache_config_num_hosts and with RSS about 400 MiB. That is about 64 KiB per entry which is way more believable.

yanjunxiang-google commented 11 months ago

@rbtz-openai Could you please share the Envoy trace logs when this is happening?

rbtz-openai commented 11 months ago

@yanjunxiang-google This envoy is under relatively high load, so sharing full logs would be complicated, maybe we can start with something simpler? e.g. based on the diff heap profile memory leak happens somewhere around dns/cluster interraction: profile001

Here is grep -E 'cluster|dns' envoy.admin.log for around 10s of traffic: envoy.public.log

If that is not enough, i can try to bleed traffic of the host and see what it does w/o any inbound load.

doujiang24 commented 11 months ago

seems it's leaking in the mutable_cross_priority_host_map_, according to the heap profile.

@rbtz-openai What is the time difference between the two profile files? it leaking 100 MB during the two profile, does it match 20 minutes leaking 10GB?

rbtz-openai commented 11 months ago

@doujiang24 yep, the time difference was about 10s between profiles (leak rate was about 10MiB/s)

doujiang24 commented 11 months ago

Okay, I see. Also, seems DNS refresh frequently, according to the trace log. I'll try to reproduce it on my side.

doujiang24 commented 11 months ago

Okay, I reproduced a leaking case on my side. Here are the key logs, with comments:

# ip changed, host_info_ is also changed
[2023-12-08 22:36:00.952][83629][debug][forward_proxy] [source/extensions/common/dynamic_forward_proxy/dns_cache_impl.cc:404] host 'test.s
ervice:80' address has changed from 210.209.160.193:80 to 166.137.92.127:80

# expired, removing the new host_info_, the old address is leaking
[2023-12-08 22:36:05.952][83629][debug][forward_proxy] [source/extensions/common/dynamic_forward_proxy/dns_cache_impl.cc:262] host='test.s
ervice:80' TTL expired, removing

This leaking happens in MainPrioritySetImpl::updateCrossPriorityHostMap, matches the info in the profile.

For now, you could try the new sub_cluster_config: https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/dynamic_forward_proxy/v3/dynamic_forward_proxy.proto#envoy-v3-api-field-extensions-filters-http-dynamic-forward-proxy-v3-filterconfig-sub-cluster-config It will create sub strict_dns clusters for each host, with TTL enabled, instead of logical_dns.

Here is a simple example that works on my side.

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 7010
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: dynamic_forward_proxy_cluster
          http_filters:
          - name: envoy.filters.http.dynamic_forward_proxy
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.dynamic_forward_proxy.v3.FilterConfig
              sub_cluster_config:
                cluster_init_timeout: 10s
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
  - name: dynamic_forward_proxy_cluster
    connect_timeout: 1s
    lb_policy: CLUSTER_PROVIDED
    dns_resolvers:
    - socket_address:
        address: 127.0.0.1
        port_value: 5300
    cluster_type:
      name: envoy.clusters.dynamic_forward_proxy
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.clusters.dynamic_forward_proxy.v3.ClusterConfig
        sub_clusters_config:
          lb_policy: round_robin
          max_sub_clusters: 100
          sub_cluster_ttl: 10s

Also, I'll try to create a PR for fixing the leaking in the logical_dns implementation.

rbtz-openai commented 11 months ago

Small update:

We had a seemingly unrelated issue with brotli decompressor in our config that occasionally got into an infinite loop inside a wrk:worker_X thread. Removing brotli decompression block[1] also somehow fixed the runaway memory usage spikes. There is still some gradual memory usage increase but it is likely due to our high number of hosts.

image

[1]

- name: envoy.filters.http.decompressor
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.http.decompressor.v3.Decompressor
      decompressor_library:
        name: default
        typed_config:
          "@type": "type.googleapis.com/envoy.extensions.compression.brotli.decompressor.v3.Brotli"
      request_direction_config:
        common_config:
          enabled:
            default_value: false
            runtime_key: request_decompressor_enabled
doujiang24 commented 11 months ago

wow, thanks for your update, seems the brotli decompression is the real reason in your case. I'll recheck the memory leaking issue that I founded in dfp.

doujiang24 commented 10 months ago

update: https://github.com/envoyproxy/envoy/pull/31433 fix a memory leaking that matches the profile.

doujiang24 commented 8 months ago

@rbtz-openai Could you please reconfirm that the dfp leak can be ignored in your production case? We are thinking reverting the bugfix in https://github.com/envoyproxy/envoy/pull/31433

rbtz-openai commented 8 months ago

sorry for the late reply, yeah, after removing brotli, dfp leak, if it exists, is not measurable at our scale.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 6 months ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.