envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.3k stars 4.7k forks source link

Healthcheck connection draining causing CPU spike #33566

Open someshsumantwilio opened 3 months ago

someshsumantwilio commented 3 months ago

If you are reporting any crash or any potential security issue, do not open an issue in this repo. Please report the issue via emailing envoy-security@googlegroups.com where the issue will be triaged appropriately.

Title: Healthcheck connection draining causing CPU spike

Description: We are observing a spike in CPU in our application for certain duration. Screen shot attached.

We also observed that during the CPU spike, health-check are draining and are recreated. Since health-check uses TLS connection which might cause CPU spike during handshake when health-check are draining and recreated. We verfied that most of the CPU usage is from envoy process.

We checked the documentation for draining and found below are the scenario when draining can happen.

  1. The server is being hot restarted.

  2. The server begins the graceful drain sequence via the drain_listeners?graceful admin endpoint.

  3. The server has been manually health check failed via the healthcheck/fail admin endpoint. See the health check filter architecture overview for more information.

  4. Individual listeners are being modified or removed via LDS.

We analyzed all the senario and we found none of this is happening to envoy. We are not sure why draining is happening for healthcheck.

  1. We verified that envoy is not being restart.
  2. We verified that server did not started the drain sequence via admin port because other listener are not draining and only healthcheck connection are draining.
  3. We have verified that server did not manually trigger the healthcheck fail.
  4. We verified that individual listener are not being modified or removed by checking the timestamp of the envoy started and last updated time in listener.

None of the above seems to cause the draining the healthcheck connection. We are not sure why the healthcheck connection is getting drain.

We need help in finding the root cause of draining.

CPU Spike.

image

Draining healthcheck connection.

image

Active Healthcheck

image

Total request to healthcheck.

image

Envoy CPU usage in individual host

image

[optional Relevant Links:] Draining

adisuissa commented 3 months ago

Thanks for reporting this! Do you happen to have a CPU profile when this is happening?

someshsumantwilio commented 2 months ago

@adisuissa Sorry we do not have CPU profile available for some reason. Please find the screen shot of the CPU spike showing envoy is taking more CPU. Also have added more details in description. We wanted to check the reason for healthcheck connection drain and recreation in this case.

image
adisuissa commented 2 months ago

Can you please send the health-check configuration you are using? Also will it be possible to look at the Envoy trace log around the time of the issue?

I wonder if there's a max-connection duration knob somewhere that is set to a default value, but at the moment it is challenging to debug without more info.

someshsumantwilio commented 1 month ago

We are using below healthcheck configuration and we have attached the trace log file when the issue is happening.

envoy_trace.tar.gz


{
          "name": "healthcheck-listener",
          "active_state": {
            "version_info": "HO9f3188be99cbbb0ed9fb440400e5f1ba-20240220.221537",
            "listener": {
              "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
              "name": "healthcheck-listener",
              "address": {
                "socket_address": {
                  "address": "0.0.0.0",
                  "port_value": 17006
                }
              },
              "filter_chains": [
                {
                  "filters": [
                    {
                      "name": "envoy.filters.network.http_connection_manager",
                      "typed_config": {
                        "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
                        "stat_prefix": "healthcheck-listener-stats",
                        "route_config": {
                          "name": "healthcheck-routes",
                          "virtual_hosts": [
                            {
                              "name": "healthcheck",
                              "domains": [
                                "*"
                              ],
                              "routes": [
                                {
                                  "match": {
                                    "path": "/voice-callmetadata-processor-opa"
                                  },
                                  "route": {
                                    "cluster": "healthcheck-voice-callmetadata-processor-opa",
                                    "prefix_rewrite": "/healthcheck",
                                    "timeout": "1s"
                                  }
                                },
                                {
                                  "match": {
                                    "path": "/voice-callmetadata-processor"
                                  },
                                  "route": {
                                    "cluster": "healthcheck-voice-callmetadata-processor",
                                    "prefix_rewrite": "/healthcheck",
                                    "timeout": "1s"
                                  }
                                }
                              ]
                            }
                          ]
                        },
                        "http_filters": [
                          {
                            "name": "envoy.health_check",
                            "typed_config": {
                              "@type": "type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck",
                              "pass_through_mode": true,
                              "cache_time": "1s"
                            }
                          },
                          {
                            "name": "envoy.filters.http.router",
                            "typed_config": {
                              "@type": "type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"
                            }
                          }
                        ],
                        "http_protocol_options": {},
                        "access_log": [
                          {
                            "name": "envoy.file_access_log",
                            "typed_config": {
                              "@type": "type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog",
                              "path": "/var/log/twilio/envoy/healthcheck_listener_access.log",
                              "log_format": {
                                "text_format_source": {
                                  "inline_string": "[%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\" \"%REQ(T-Request-Id?I-Twilio-Request-Id):34%\" \"%UPSTREAM_CLUSTER%\"\n"
                                }
                              }
                            }
                          }
                        ]
                      }
                    }
                  ],
                  "transport_socket": {
                    "name": "tls",
                    "typed_config": {
                      "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext",
                      "common_tls_context": {
                        "tls_certificate_sds_secret_configs": [
                          {
                            "name": "spiffe://prod.svc.twilio.com/realm/us1/role/voice-callmetadata-processor",
                            "sds_config": {
                              "api_config_source": {
                                "api_type": "GRPC",
                                "grpc_services": [
                                  {
                                    "envoy_grpc": {
                                      "cluster_name": "spire_agent"
                                    }
                                  }
                                ],
                                "rate_limit_settings": {
                                  "max_tokens": 5,
                                  "fill_rate": 4
                                },
                                "transport_api_version": "V3"
                              }
                            }
                          }
                        ]
                      },
                      "require_client_certificate": false
                    }
                  }
                }
              ],
              "listener_filters": [
                {
                  "name": "envoy.filters.listener.tls_inspector",
                  "typed_config": {
                    "@type": "type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector"
                  }
                }
              ]
            },
            "last_updated": "2024-02-20T22:15:37.656Z"
          }
        },
someshsumantwilio commented 1 month ago

Do we have any update on issue?

someshsumantwilio commented 3 weeks ago

Do we have any update on issue?