Open someshsumantwilio opened 3 months ago
Thanks for reporting this! Do you happen to have a CPU profile when this is happening?
@adisuissa Sorry we do not have CPU profile available for some reason. Please find the screen shot of the CPU spike showing envoy is taking more CPU. Also have added more details in description. We wanted to check the reason for healthcheck connection drain and recreation in this case.
Can you please send the health-check configuration you are using? Also will it be possible to look at the Envoy trace log around the time of the issue?
I wonder if there's a max-connection duration knob somewhere that is set to a default value, but at the moment it is challenging to debug without more info.
We are using below healthcheck configuration and we have attached the trace log file when the issue is happening.
{
"name": "healthcheck-listener",
"active_state": {
"version_info": "HO9f3188be99cbbb0ed9fb440400e5f1ba-20240220.221537",
"listener": {
"@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
"name": "healthcheck-listener",
"address": {
"socket_address": {
"address": "0.0.0.0",
"port_value": 17006
}
},
"filter_chains": [
{
"filters": [
{
"name": "envoy.filters.network.http_connection_manager",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
"stat_prefix": "healthcheck-listener-stats",
"route_config": {
"name": "healthcheck-routes",
"virtual_hosts": [
{
"name": "healthcheck",
"domains": [
"*"
],
"routes": [
{
"match": {
"path": "/voice-callmetadata-processor-opa"
},
"route": {
"cluster": "healthcheck-voice-callmetadata-processor-opa",
"prefix_rewrite": "/healthcheck",
"timeout": "1s"
}
},
{
"match": {
"path": "/voice-callmetadata-processor"
},
"route": {
"cluster": "healthcheck-voice-callmetadata-processor",
"prefix_rewrite": "/healthcheck",
"timeout": "1s"
}
}
]
}
]
},
"http_filters": [
{
"name": "envoy.health_check",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck",
"pass_through_mode": true,
"cache_time": "1s"
}
},
{
"name": "envoy.filters.http.router",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"
}
}
],
"http_protocol_options": {},
"access_log": [
{
"name": "envoy.file_access_log",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog",
"path": "/var/log/twilio/envoy/healthcheck_listener_access.log",
"log_format": {
"text_format_source": {
"inline_string": "[%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\" \"%REQ(T-Request-Id?I-Twilio-Request-Id):34%\" \"%UPSTREAM_CLUSTER%\"\n"
}
}
}
}
]
}
}
],
"transport_socket": {
"name": "tls",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext",
"common_tls_context": {
"tls_certificate_sds_secret_configs": [
{
"name": "spiffe://prod.svc.twilio.com/realm/us1/role/voice-callmetadata-processor",
"sds_config": {
"api_config_source": {
"api_type": "GRPC",
"grpc_services": [
{
"envoy_grpc": {
"cluster_name": "spire_agent"
}
}
],
"rate_limit_settings": {
"max_tokens": 5,
"fill_rate": 4
},
"transport_api_version": "V3"
}
}
}
]
},
"require_client_certificate": false
}
}
}
],
"listener_filters": [
{
"name": "envoy.filters.listener.tls_inspector",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector"
}
}
]
},
"last_updated": "2024-02-20T22:15:37.656Z"
}
},
Do we have any update on issue?
Do we have any update on issue?
If you are reporting any crash or any potential security issue, do not open an issue in this repo. Please report the issue via emailing envoy-security@googlegroups.com where the issue will be triaged appropriately.
Title: Healthcheck connection draining causing CPU spike
Description: We are observing a spike in CPU in our application for certain duration. Screen shot attached.
We also observed that during the CPU spike, health-check are draining and are recreated. Since health-check uses TLS connection which might cause CPU spike during handshake when health-check are draining and recreated. We verfied that most of the CPU usage is from envoy process.
We checked the documentation for draining and found below are the scenario when draining can happen.
The server is being hot restarted.
The server begins the graceful drain sequence via the drain_listeners?graceful admin endpoint.
The server has been manually health check failed via the healthcheck/fail admin endpoint. See the health check filter architecture overview for more information.
Individual listeners are being modified or removed via LDS.
We analyzed all the senario and we found none of this is happening to envoy. We are not sure why draining is happening for healthcheck.
None of the above seems to cause the draining the healthcheck connection. We are not sure why the healthcheck connection is getting drain.
We need help in finding the root cause of draining.
CPU Spike.
Draining healthcheck connection.
Active Healthcheck
Total request to healthcheck.
Envoy CPU usage in individual host
[optional Relevant Links:] Draining