aws / aws-app-mesh-roadmap

AWS App Mesh is a service mesh that you can use with your microservices to manage service to service communication
Apache License 2.0
347 stars 25 forks source link

Load balancing ecs service multiple tasks with app mesh #70

Closed shashanktomar closed 4 years ago

shashanktomar commented 5 years ago

We are trying out app-mesh. Our solution looks like the following:

Mesh Ingress (ALB) -> Service A -> Service B -> Mesh Egress -> External Service

In this setup, Service A and Service B are running over fargate with desired-count of 2 for each. These ECS services have cloud-map configuration enabled. We have verified that the traffic flows through envoy sidecar for both Service A and Service B.

When we hit ALB for ingress, the traffic is load-balanced for Service A by the ALB but the traffic from Service A always hit the same single task in Service B. We are unable to load balance traffic from Service A envoy proxy to Service B. Following is our understanding of the problem so far:

It will be helpful to understand why is it not working for us and also to get some idea about recommended practices around this pattern.

shashanktomar commented 5 years ago

Can someone please confirm that this relates to #47

lavignes commented 5 years ago

@shashanktomar Your observation is correct here. It seems that due to the fact that App Mesh will configure the cluster to use the LOGICAL_DNS discovery type Envoy will only route traffic to the first IP returned by Route53's DNS.

47 is not intended to address this behavior directly but I believe when App Mesh integrates with CloudMap's Service Discovery (via Envoy's EDS Service Discovery type) rather than LOGICAL_DNS, Envoy would be able to load-balance each IP independently.

I'll discuss this with the team tomorrow, because I'm unsure if this is intended behavior.

bcelenza commented 5 years ago

Route 53 DNS does round-robin its results. Is it possible what you're seeing is the result of low load from Service A to Service B? One possible answer might be that the first connection added to the pool in Service A's Envoy is serving all requests to Service B, always resulting in requests hitting the same task for Service B.

With some concurrent load, Service A's Envoy would need to create a new connection, which would begin load balancing to both tasks for Service B.

shashanktomar commented 5 years ago

Thanks for the clarification @bcelenza. I will verify this behavior and get back to you.

shashanktomar commented 5 years ago

@bcelenza, that was precisely the reason why it happened. Increasing the load and having concurrent requests invoke connections to the second task. Maybe it's worth mentioning in the docs that the load-balancing policy is not round-robin but spillover, even though it's an envoy internal concept.

shreedharn commented 5 years ago

We are evaluating a similar setup but with http-namespace. There is no DNS A records or Route53 health checks. We rely on container-level health checks managed by ECS and HealthCheckCustomConfig. Since there is no Route53 in this scenario, we assume that envoy will route traffic to healthy nodes discovered with CloudMap discover-instances api call. if there are multiple healthy nodes registered to a CloudMap service, order of instances in discover-instances api response appears random. Is there any load balancing done at envoy or is the call to the upstream nodes random? With AppMesh configured and sidecar deployed, preliminary test results shows that backend nodes are picked randomly for the upstream request(Service A --> Service B with multiple Service B instances) . It will be good to have the expected behavior documented.

lavignes commented 5 years ago

@shreedharn sorry for the late reply. App Mesh also uses the discover-instances API and return the instances as-is to Envoy. All load-balancing is handled by Envoy itself currently... So this randomness appears to be due to that. Besides us documenting this behavior, we'd like to make this behave in a way that makes the most sense for our customers.

I'm worried we might also be re-randomizing the nodes every time new nodes are added and removed. I'll take a look at that is well. I'm not entirely sure Envoy handles this correctly.

shreedharn commented 5 years ago

@lavignes Thanks for the info. Can you also document the details of envoy's default load balancing algorithm? When we checked the config dump of the envoy(envoy_eni_ip:9901/config_dump) there is an entry "lb_policy": "ORIGINAL_DST_LB" under dynamic_active_clusters. The envoy documentation has a link describing the Original destination LB. But not sure how it works in the context of ECS, CloudMap and AppMesh. Is there anyway we can configure the load balancer algorithm in the envoy. For instance can we change it to Round robin or Weighted least request? The Envoy documentation part of AppMesh is very limited. It will be very helpful if it can be expanded with these details.

shubharao commented 4 years ago

We are configuring round-robin load balancing and this is not yet configurable in App Mesh API (feel free to open an feature request for it if you need it). The way it works is that Envoy will not connect to a different endpoint for each request and balance across all endpoints round-robin. It maintains as few TCP connections as possible, only creating new ones when an existing connection is busy. But once the connections are open, requests will be distributed round-robin. We will update App Mesh docs with it, good point on highlighting the doc gap here, thank you!