envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25k stars 4.81k forks source link

Envoy memory usage rises indefinitely when failing to fetch configuration via the xDS protocol #31561

Open radwaretaltr opened 10 months ago

radwaretaltr commented 10 months ago

Description: We use the Basic xDS protocol to deliver dynamic configuration to Envoy from a management server. In addition, we have a static configuration with one listener and one cluster. It seems that if Envoy fails to fetch the configuration from the management server, for example, when the management server is down, Envoy's memory usage rises indefinitely (approximately ~1Mb per minute). It only stops when Envoy is able to fetch the configuration.

Some observations we were able to make: 1) This only happens when Envoy is unable to fetch the configuration since startup. 2) Removing the static listener and cluster from the config seems to fix the issue (memory doesn't rise indefinitely).

Repro steps: 1) Deploy envoy with the static configuration specified in the configuration section. There is no need to setup an actual management server. 2) Monitor the memory usage, for example, via the /memory endpoint in the admin interface.

Config

admin:
  access_log:
  - typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
      path: /dev/null
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 19000

node:
  cluster: protected_app
  id: aaa

dynamic_resources:
  ads_config:
    api_type: GRPC
    transport_api_version: V3
    grpc_services:
      envoy_grpc:
        cluster_name: cluster1
        authority: "auth"
  cds_config:
    ads: {}
    resource_api_version: V3
    initial_fetch_timeout: 0s
  lds_config:
    ads: {}
    resource_api_version: V3
    initial_fetch_timeout: 0s

static_resources:
  clusters:
  - name: cluster1
    connect_timeout: 10s
    type: "STRICT_DNS"
    dns_refresh_rate: 180s
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: cluster1
      endpoints:
      - lb_endpoints:
          endpoint:
            address:
              socket_address:
                address: "control-plane"
                port_value: 8080
    typed_extension_protocol_options:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
        explicit_http_config:
          http2_protocol_options:
            connection_keepalive:
              interval: 30s
              timeout: 5s
    upstream_connection_options:
      tcp_keepalive:
        keepalive_time: 300

  - name: a_metrics
    connect_timeout: 1s
    type: "static"
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: a_metrics
      endpoints:
      - lb_endpoints:
          endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 19000
  listeners:
    - name: lst-metrics
      address:
        socket_address:
          address: "0.0.0.0"
          port_value: 19010
      filter_chains:
        - filters:
            - name: envoy.http_connection_manager
              typedConfig:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                codec_type: auto
                stat_prefix: a_metrics
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: backend
                      domains:
                        - "*"
                      routes:
                        - match:
                            prefix: "/stats/prometheus"
                          route:
                            cluster: a_metrics
                        - match:
                            prefix: "/metrics"
                          route:
                            regex_rewrite:
                              pattern:
                                regex: "metrics"
                              substitution: "stats/prometheus?text_readouts"
                            cluster: a_metrics
                      response_headers_to_remove: ["Server"]
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                server_header_transformation: PASS_THROUGH
Logs ``` ':status', '200' 'content-type', 'application/grpc' 'grpc-status', '14' 'grpc-message', 'no healthy upstream' [2023-12-31 15:54:26.116][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:190] StreamAggregatedResources gRPC config stream to cluster1 closed since 6654s ago: 14, no healthy upstream [2023-12-31 15:54:26.116][1][debug][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:125] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed [2023-12-31 15:54:26.116][1][debug][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:60] Unable to establish new stream to configuration server cluster1 [2023-12-31 15:54:26.116][1][debug][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:125] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed [2023-12-31 15:54:26.573][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:54:26.573][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:54:31.573][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:54:31.573][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:54:31.786][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:365] dns resolution for control-plane started [2023-12-31 15:54:31.794][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for control-plane failed with c-ares status 11 [2023-12-31 15:54:31.799][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for control-plane failed with c-ares status 11 [2023-12-31 15:54:31.799][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:285] dns resolution for control-plane completed with status 1 [2023-12-31 15:54:31.799][1][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:184] DNS refresh rate reset for control-plane, (failure) refresh rate 180000 ms [2023-12-31 15:54:33.622][1][debug][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:51] Establishing new gRPC bidi stream to cluster1 for rpc StreamAggregatedResources(stream .envoy.service.discovery.v3.DiscoveryRequest) returns (stream .envoy.service.discovery.v3.DiscoveryResponse); [2023-12-31 15:54:33.623][1][debug][router] [source/common/router/router.cc:520] [Tags: "ConnectionId":"0","StreamId":"16089286589495779457"] cluster 'cluster1' match for URL '/envoy.service.discovery.v3.AggregatedDiscoveryService/StreamAggregatedResources' [2023-12-31 15:54:33.623][1][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1942] no healthy host for HTTP connection pool [2023-12-31 15:54:33.623][1][debug][http] [source/common/http/async_client_impl.cc:106] async http request response headers (end_stream=true): ':status', '200' 'content-type', 'application/grpc' 'grpc-status', '14' 'grpc-message', 'no healthy upstream' [2023-12-31 15:54:33.623][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:190] StreamAggregatedResources gRPC config stream to cluster1 closed since 6662s ago: 14, no healthy upstream [2023-12-31 15:54:33.623][1][debug][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:125] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed [2023-12-31 15:54:33.623][1][debug][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:60] Unable to establish new stream to configuration server cluster1 [2023-12-31 15:54:33.623][1][debug][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:125] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed [2023-12-31 15:54:36.573][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:54:36.573][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:54:41.576][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:54:41.576][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:54:44.949][1][debug][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:51] Establishing new gRPC bidi stream to cluster1 for rpc StreamAggregatedResources(stream .envoy.service.discovery.v3.DiscoveryRequest) returns (stream .envoy.service.discovery.v3.DiscoveryResponse); [2023-12-31 15:54:44.950][1][debug][router] [source/common/router/router.cc:520] [Tags: "ConnectionId":"0","StreamId":"2604332730402544397"] cluster 'cluster1' match for URL '/envoy.service.discovery.v3.AggregatedDiscoveryService/StreamAggregatedResources' [2023-12-31 15:54:44.950][1][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1942] no healthy host for HTTP connection pool [2023-12-31 15:54:44.950][1][debug][http] [source/common/http/async_client_impl.cc:106] async http request response headers (end_stream=true): ':status', '200' 'content-type', 'application/grpc' 'grpc-status', '14' 'grpc-message', 'no healthy upstream' [2023-12-31 15:54:44.950][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:190] StreamAggregatedResources gRPC config stream to cluster1 closed since 6673s ago: 14, no healthy upstream [2023-12-31 15:54:44.950][1][debug][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:125] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed [2023-12-31 15:54:44.950][1][debug][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:60] Unable to establish new stream to configuration server cluster1 [2023-12-31 15:54:44.950][1][debug][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:125] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed [2023-12-31 15:54:46.577][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:54:46.577][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:54:51.574][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:54:51.575][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:54:56.575][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:54:56.575][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:55:01.574][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:55:01.574][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:55:06.577][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:55:06.577][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:55:09.067][1][debug][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:51] Establishing new gRPC bidi stream to cluster1 for rpc StreamAggregatedResources(stream .envoy.service.discovery.v3.DiscoveryRequest) returns (stream .envoy.service.discovery.v3.DiscoveryResponse); [2023-12-31 15:55:09.067][1][debug][router] [source/common/router/router.cc:520] [Tags: "ConnectionId":"0","StreamId":"15864330249036367919"] cluster 'cluster1' match for URL '/envoy.service.discovery.v3.AggregatedDiscoveryService/StreamAggregatedResources' [2023-12-31 15:55:09.067][1][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1942] no healthy host for HTTP connection pool [2023-12-31 15:55:09.067][1][debug][http] [source/common/http/async_client_impl.cc:106] async http request response headers (end_stream=true): ':status', '200' 'content-type', 'application/grpc' 'grpc-status', '14' 'grpc-message', 'no healthy upstream' [2023-12-31 15:55:09.067][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:190] StreamAggregatedResources gRPC config stream to cluster1 closed since 6697s ago: 14, no healthy upstream [2023-12-31 15:55:09.067][1][debug][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:125] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed [2023-12-31 15:55:09.067][1][debug][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:60] Unable to establish new stream to configuration server cluster1 [2023-12-31 15:55:09.067][1][debug][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:125] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed [2023-12-31 15:55:11.574][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:55:11.574][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:55:16.575][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:55:16.575][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:55:21.577][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:55:21.577][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:55:23.860][1][debug][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:51] Establishing new gRPC bidi stream to cluster1 for rpc StreamAggregatedResources(stream .envoy.service.discovery.v3.DiscoveryRequest) returns (stream .envoy.service.discovery.v3.DiscoveryResponse); [2023-12-31 15:55:23.860][1][debug][router] [source/common/router/router.cc:520] [Tags: "ConnectionId":"0","StreamId":"11845650912692327212"] cluster 'cluster1' match for URL '/envoy.service.discovery.v3.AggregatedDiscoveryService/StreamAggregatedResources' [2023-12-31 15:55:23.860][1][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1942] no healthy host for HTTP connection pool [2023-12-31 15:55:23.860][1][debug][http] [source/common/http/async_client_impl.cc:106] async http request response headers (end_stream=true): ':status', '200' 'content-type', 'application/grpc' 'grpc-status', '14' 'grpc-message', 'no healthy upstream' [2023-12-31 15:55:23.860][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:190] StreamAggregatedResources gRPC config stream to cluster1 closed since 6712s ago: 14, no healthy upstream [2023-12-31 15:55:23.860][1][debug][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:125] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed [2023-12-31 15:55:23.860][1][debug][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:60] Unable to establish new stream to configuration server cluster1 [2023-12-31 15:55:23.860][1][debug][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:125] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed [2023-12-31 15:55:26.574][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:55:26.574][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:55:31.574][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:55:31.574][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats [2023-12-31 15:55:36.576][1][debug][main] [source/server/server.cc:263] flushing stats [2023-12-31 15:55:36.576][1][debug][main] [source/server/server.cc:273] Envoy is not fully initialized, skipping histogram merge and flushing stats ```
adisuissa commented 10 months ago

Thanks for reporting this! Do you happen to have flame graph showing memory usage after it is run for some time while failing to fetch a configuration?

radwaretaltr commented 10 months ago

Thanks for reporting this! Do you happen to have flame graph showing memory usage after it is run for some time while failing to fetch a configuration?

Hi, thanks for responding. No, we don't have one.

cgthayer commented 9 months ago

I let this run overnight and got this profile. I don't have experience reading these yet, but I've attached the overall shape and a zoom on the red parts. The last call leading into the red part is TimerImpl, so I wonder if some timers are created but never closed for each attempt to fetch the config --I'm not familiar with the code yet.

fwiw, I do something like this. You'll need to run pprof to have a good look at the data.

curl 127.0.0.1:19000/heap_dump -o memory-bug.heap
pprof -http localhost:9999 memory-bug.heap

mem-leak-profile.pb.gz mem-overview-shape mem-red-zoom memory-bug.heap.gz

memory-leak-test.yaml.txt --remove the .txt that's a github weirdness

adisuissa commented 9 months ago

Seems that The TlsCachingDateProviderImpl creates objects in the thread local storage (https://github.com/envoyproxy/envoy/blob/4265b99cfb42516296d95c63da9b7e02990264af/source/common/http/date_provider_impl.cc#L27) that are never deallocated when the xDS server is unavailable. I'm not sure why the objects are not freed, and further investigation is required.

cgthayer commented 9 months ago

Thanks! I haven't seen one of these with the dotted arrow going up, but if the TimerImpl is properly cleaned up (on xDS fail) that should clean up the TlsCachingDateProviderImpl too, correct?

weissne commented 8 months ago

@cgthayer & @adisuissa - Thanks for investigating this. Are you aware of any plans for fixing this?

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

radwaretaltr commented 7 months ago

Hi, thanks. Any plans on fixing this?

adisuissa commented 7 months ago

At the moment my bandwidth is limited. Setting it as no-stale.