aws / aws-app-mesh-roadmap

AWS App Mesh is a service mesh that you can use with your microservices to manage service to service communication
Apache License 2.0
347 stars 25 forks source link

Feature Request: AppMesh Control Plane to take over ServiceDiscovery in DNS mode. #271

Open achevuru opened 4 years ago

achevuru commented 4 years ago

If you want to see App Mesh implement this idea, please upvote with a :+1:.

Tell us about your request What do you want us to build?

AppMesh in DNS based service discovery mode configures Envoys to resolve backends via cluster local DNS/CoreDNS. Envoys reach out to CoreDNS to resolve the backend services and refreshes the backend entries every 5s. As we scale the number of envoys in the cluster, number of CoreDNS calls Envoys make will increase proportionately and puts a rather high load on CoreDNS. We observed that a 5000 Envoy test with 5 backend VirtualServices per VirtualNode is resulting in almost 1+ million req/min to CoreDNS and we had to scale CoreDNS to 225+ replicas to get the tests working. In our tests, ndots was set to default value of 5 and each Envoy request(not FQDN) to CoreDNS in turn resulted in querying across all the configured search domains.

For a 5000 Envoy test with 5 backend VirtualServices per VirtualNode :- 5000 Envoys 5 backend services 12 queries/minute/backend == 300000 req/min to CoreDNS. This assumes the best case scenario where Envoy is querying with FQDN.

Issue: Current approach might not be ideal for DNS ServiceDiscovery mode, as we scale beyond 5K+ pods or increase number of backend services per VirtualNode. Load on CoreDNS will gradually increase as we scale Envoys. We will run in to challenges if we want to scale to 15K+ envoys.

Feature Request: AppMesh control plane to take over the DNS resolution of services and distribute the configs to Envoys similar to CloudMap mode. All the envoys behind a single VirtualNode will have the same set of backends but in the current flow every envoy individually reaches out to CoreDNS every 5s resulting in multiple duplicate requests thereby increasing the load on CoreDNS.

Which integration(s) is this request for? All (Request is however based on scale test experience on EKS).

Are you currently working around this issue? Yes

How are you currently solving this problem? Scaling CoreDNS.

bcelenza commented 4 years ago

I would suggest we don't rush to the conclusion that the control plane should be resolving DNS, but rather keep the issue focused on the general scalability of DNS resolution for high scale clusters (and cut separate issues for known work). Given the managed control plane doesn't operate in the cluster, it does not have access to this information by default.

There are a few other possible solutions here:

  1. Make DNS resolution frequency configurable (stopgap)
  2. Add a caching layer between CoreDNS and the Envoys to help scale out (effectively the same as the control plane distributing the info)
  3. Improve the Cloud Map experience such that it replaces CoreDNS for high scale clusters
achevuru commented 4 years ago

Sure @bcelenza, understand. Feature Request for AppMesh CP to manage this is mainly because it has visibility on to which Envoy is interested in which backend and can selectively push the information only when there is a change making it more efficient. Had a chat with @dastbe a while ago around this...We do have viable workarounds for now.

  1. Make DNS resolution frequency configurable (stopgap)

    • Agree, this can be a near term solution. Right now, in AppMesh DNS mode Envoys operate with Service VIPs as opposed to individual endpoints(https://github.com/aws/aws-app-mesh-roadmap/issues/238). So, the Service VIP will not change even if the deployment behind a VirtualNode scales up or down and we can probably bump up the refresh frequency.
  2. Add a caching layer between CoreDNS and the Envoys to help scale out (effectively the same as the control plane distributing the info)

  3. Improve the Cloud Map experience such that it replaces CoreDNS for high scale clusters

    • Agree, might be the best alternative.

Will open a separate issue for the ability to configure refresh frequency in Envoys...