envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.08k stars 4.82k forks source link

Healthcheck draining causing CPU spike #36967

Open someshsumantwilio opened 2 weeks ago

someshsumantwilio commented 2 weeks ago

Title: Healthcheck draining causing CPU spike

Description:

We are seeing that health check draining is causing CPU spike in our application. We have observed this issue in our couple of system. We are not sure why we are getting healthcheck draining. We have already reported this issue(https://github.com/envoyproxy/envoy/issues/33566) before but we have not got the resolution for this.

We tried to analyze the envoy trace log file(envoy_trace.txt) and envoy access log (envoy_access.txt) collected from host which was affected by health check draining but we could not found the root cause of health check draining.

We need help from community to figure why we are having connection draining.

Below are some of dashboard which shows the strong correlation of health check draining causing CPU spike.

Picture1 shows that CPU spike during the same time frame when connection draining is happening. Picture 2 shows that connection draining is happening which is causing CPU spike. Picture 3 shows that active health check connection is dropping due to draining.

We have

Picture1

image

Picture 2

image

Picture 3

image

Below is the healthcheck configuration.

Listener Information

                                    {
   "name":"healthcheck-listener",
   "active_state":{
      "version_info":"HO1580d0ef70e25f332ca1936074f656de-20241104.081745",
      "listener":{
         "@type":"type.googleapis.com/envoy.config.listener.v3.Listener",
         "name":"healthcheck-listener",
         "address":{
            "socket_address":{
               "address":"0.0.0.0",
               "port_value":17006
            }
         },
         "filter_chains":[
            {
               "filters":[
                  {
                     "name":"envoy.filters.network.http_connection_manager",
                     "typed_config":{
                        "@type":"type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
                        "stat_prefix":"healthcheck-listener-stats",
                        "route_config":{
                           "name":"healthcheck-routes",
                           "virtual_hosts":[
                              {
                                 "name":"healthcheck",
                                 "domains":[
                                    "*"
                                 ],
                                 "routes":[
                                    {
                                       "match":{
                                          "path":"/accounts-api-opa"
                                       },
                                       "route":{
                                          "cluster":"healthcheck-accounts-api-opa",
                                          "prefix_rewrite":"/healthcheck",
                                          "timeout":"3s"
                                       }
                                    },
                                    {
                                       "match":{
                                          "path":"/accounts-api"
                                       },
                                       "route":{
                                          "cluster":"healthcheck-accounts-api",
                                          "prefix_rewrite":"/healthcheck",
                                          "timeout":"1s"
                                       }
                                    },
                                    {
                                       "match":{
                                          "path":"/accounts-api.analytics"
                                       },
                                       "route":{
                                          "cluster":"healthcheck-accounts-api.analytics",
                                          "prefix_rewrite":"/",
                                          "timeout":"1s"
                                       }
                                    },
                                    {
                                       "match":{
                                          "path":"/accounts-api.xrp"
                                       },
                                       "route":{
                                          "cluster":"healthcheck-accounts-api.xrp",
                                          "prefix_rewrite":"/healthcheck",
                                          "timeout":"1s"
                                       }
                                    }
                                 ]
                              }
                           ]
                        },
                        "http_filters":[
                           {
                              "name":"envoy.health_check",
                              "typed_config":{
                                 "@type":"type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck",
                                 "pass_through_mode":true,
                                 "cache_time":"1s"
                              }
                           },
                           {
                              "name":"envoy.filters.http.router",
                              "typed_config":{
                                 "@type":"type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"
                              }
                           }
                        ],
                        "http_protocol_options":{

                        },
                        "access_log":[
                           {
                              "name":"envoy.file_access_log",
                              "typed_config":{
                                 "@type":"type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog",
                                 "path":"/var/log/twilio/envoy/healthcheck_listener_access.log",
                                 "log_format":{
                                    "text_format_source":{
                                       "inline_string":"[%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\" \"%REQ(T-Request-Id?I-Twilio-Request-Id):34%\" \"%UPSTREAM_CLUSTER%\"\n"
                                    }
                                 }
                              }
                           }
                        ]
                     }
                  }
               ],
               "transport_socket":{
                  "name":"tls",
                  "typed_config":{
                     "@type":"type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext",
                     "common_tls_context":{
                        "tls_certificate_sds_secret_configs":[
                           {
                              "name":"spiffe://prod.svc.twilio.com/realm/us1/role/accounts-api",
                              "sds_config":{
                                 "api_config_source":{
                                    "api_type":"GRPC",
                                    "grpc_services":[
                                       {
                                          "envoy_grpc":{
                                             "cluster_name":"spire_agent"
                                          }
                                       }
                                    ],
                                    "rate_limit_settings":{
                                       "max_tokens":5,
                                       "fill_rate":4
                                    },
                                    "transport_api_version":"V3"
                                 }
                              }
                           }
                        ]
                     },
                     "require_client_certificate":false
                  }
               }
            }
         ],
         "listener_filters":[
            {
               "name":"envoy.filters.listener.tls_inspector",
               "typed_config":{
                  "@type":"type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector"
               }
            }
         ]
      },
      "last_updated":"2024-11-04T08:18:01.840Z"
   }
},

Cluster Information:

{
   "version_info":"HO1580d0ef70e25f332ca1936074f656de-20241104.081745",
   "cluster":{
      "@type":"type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name":"healthcheck-accounts-api-opa",
      "type":"STATIC",
      "connect_timeout":"6s",
      "metadata":{
         "filter_metadata":{
            "com.twilio.logging":{

            }
         }
      },
      "alt_stat_name":"accounts-api-opa.service=accounts-api-opa,role=accounts-api,realm=us1,stack=default,type=healthcheck",
      "load_assignment":{
         "cluster_name":"healthcheck-accounts-api-opa",
         "endpoints":[
            {
               "lb_endpoints":[
                  {
                     "endpoint":{
                        "address":{
                           "socket_address":{
                              "address":"127.0.0.1",
                              "port_value":9651
                           }
                        }
                     }
                  }
               ]
            }
         ]
      },
      "typed_extension_protocol_options":{
         "envoy.extensions.upstreams.http.v3.HttpProtocolOptions":{
            "@type":"type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions",
            "explicit_http_config":{
               "http_protocol_options":{

               }
            }
         }
      }
   },
   "last_updated":"2024-11-04T08:17:45.733Z"
},
{
   "version_info":"HO1580d0ef70e25f332ca1936074f656de-20241104.081745",
   "cluster":{
      "@type":"type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name":"healthcheck-accounts-api",
      "type":"STATIC",
      "connect_timeout":"2s",
      "metadata":{
         "filter_metadata":{
            "com.twilio.logging":{

            }
         }
      },
      "alt_stat_name":"accounts-api.service=accounts-api,role=accounts-api,realm=us1,stack=default,type=healthcheck",
      "load_assignment":{
         "cluster_name":"healthcheck-accounts-api",
         "endpoints":[
            {
               "lb_endpoints":[
                  {
                     "endpoint":{
                        "address":{
                           "socket_address":{
                              "address":"127.0.0.1",
                              "port_value":9651
                           }
                        }
                     }
                  }
               ]
            }
         ]
      },
      "typed_extension_protocol_options":{
         "envoy.extensions.upstreams.http.v3.HttpProtocolOptions":{
            "@type":"type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions",
            "explicit_http_config":{
               "http_protocol_options":{

               }
            }
         }
      }
   },
   "last_updated":"2024-11-04T08:17:45.749Z"
},
{
   "version_info":"HO1580d0ef70e25f332ca1936074f656de-20241104.081745",
   "cluster":{
      "@type":"type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name":"healthcheck-accounts-api.analytics",
      "type":"STATIC",
      "connect_timeout":"2s",
      "metadata":{
         "filter_metadata":{
            "com.twilio.logging":{

            }
         }
      },
      "alt_stat_name":"accounts-api_analytics.service=accounts-api.analytics,role=accounts-api,realm=us1,stack=default,type=healthcheck",
      "load_assignment":{
         "cluster_name":"healthcheck-accounts-api.analytics",
         "endpoints":[
            {
               "lb_endpoints":[
                  {
                     "endpoint":{
                        "address":{
                           "socket_address":{
                              "address":"127.0.0.1",
                              "port_value":9651
                           }
                        }
                     }
                  }
               ]
            }
         ]
      },
      "typed_extension_protocol_options":{
         "envoy.extensions.upstreams.http.v3.HttpProtocolOptions":{
            "@type":"type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions",
            "common_http_protocol_options":{
               "idle_timeout":"301s"
            },
            "explicit_http_config":{
               "http_protocol_options":{

               }
            }
         }
      }
   },
   "last_updated":"2024-11-04T08:17:45.768Z"
},
{
   "version_info":"HO1580d0ef70e25f332ca1936074f656de-20241104.081745",
   "cluster":{
      "@type":"type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name":"healthcheck-accounts-api.xrp",
      "type":"STATIC",
      "connect_timeout":"5s",
      "metadata":{
         "filter_metadata":{
            "com.twilio.logging":{

            }
         }
      },
      "alt_stat_name":"accounts-api_xrp.service=accounts-api.xrp,role=accounts-api,realm=us1,stack=default,type=healthcheck",
      "load_assignment":{
         "cluster_name":"healthcheck-accounts-api.xrp",
         "endpoints":[
            {
               "lb_endpoints":[
                  {
                     "endpoint":{
                        "address":{
                           "socket_address":{
                              "address":"127.0.0.1",
                              "port_value":9651
                           }
                        }
                     }
                  }
               ]
            }
         ]
      },
      "typed_extension_protocol_options":{
         "envoy.extensions.upstreams.http.v3.HttpProtocolOptions":{
            "@type":"type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions",
            "explicit_http_config":{
               "http_protocol_options":{

               }
            }
         }
      }
   },
   "last_updated":"2024-11-04T08:17:45.788Z"
},

[optional Relevant Links:]

Any extra documentation required to understand the issue.

KBaichoo commented 2 weeks ago

STM that the spikes in cpu occur during the drain period. If you look at other stats on connections such as total connection and destroyed connections (possibly looking at listener stats that happen earlier: https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/stats) what do you see?

cc @adisuissa as codeowner for http health_check filter.

someshsumantwilio commented 2 weeks ago

Below is the metric we see.

image image image image image