aws / aws-app-mesh-roadmap

AWS App Mesh is a service mesh that you can use with your microservices to manage service to service communication
Apache License 2.0
347 stars 25 forks source link

Feature Request: Health Check Listener Filter / Pass-through Configuration #105

Open bcelenza opened 5 years ago

bcelenza commented 5 years ago

Tell us about your request With aws/aws-app-mesh-roadmap#28 App Mesh added the ability to configure health checking parameters for a given Virtual Node's listener. When configured downstream (client) Envoys will perform active health checking using the timing and thresholds provided. These health checks are currently configured to be passed through the Envoy to the application to handle, which ensures the service behind the Envoy is healthy to its clients.

For larger scale deployments, however, the health checking traffic can induce too much load in the application behind the Envoy, and result in a significant amount of traffic being health related.

Envoy allows for a health check filter to be added to the ingress listener on the Envoy that's receiving the health check traffic, and can be configured to intercept and respond to the health check directly, pass the traffic through to the application, or some combination of the two.

This feature request is to add options on Virtual Node health check API for configuring the filter behavior on the Envoy receiving the health check traffic.

Which integration(s) is this request for? All

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? While smaller installations of App Mesh and Envoy will work well with the default health checking capability, larger installations may incur performance challenges without the use of the filter.

Are you currently working around this issue? You can reduce the amount of health checking traffic between Envoys today by adjusting the interval at which the health check is performed, but this comes with the trade-off of potentially not having the most up-to-date health information of the upstream service.

shubharao commented 5 years ago

Would like to get inputs on whether this is an issue currently to help prioritizing this. If yes, at what scale - # of services or # of envoys?

joshuabaird commented 5 years ago

Yes - this is an issue. It makes health checks in large deployments completely unusable. I don't have an idea of at what scale it actually becomes an issue just yet.

rlafferty commented 5 years ago

Currently, today we leverage ALB healthchecking, which I believe the number of healthchecks is driven by the number of of AZs the ALB has hosts in. Even in us-east-1 if you are in 6 AZs, each container would get 6 healthcheck requests every N seconds. In app mesh, for services that are listed as backends of MANY other services (think common services like auth service or user service, etc) they can have HUNDREDS of other containers w/ envoy proxies that depend on it. This huge ramp up in volume of healthchecking traffic makes the current app mesh healthchecks unusable for us.

egkelly commented 1 year ago

I can confirm this is an issue, our applications are getting completely bogged down by the envoy healthcheck in tandem with the kubelet health check, generating an excessive amount of traffic. We really need the ability to configure this.