aws / aws-app-mesh-roadmap

AWS App Mesh is a service mesh that you can use with your microservices to manage service to service communication
Apache License 2.0
347 stars 25 forks source link

Bug: Active healthchecks, TLS, and DNS service discovery on a Virtual Node can delay Envoy initialization #227

Open dastbe opened 4 years ago

dastbe commented 4 years ago

Summary

When a source Virtual Node routes to a destination Virtual Node that has both TLS and active healthchecks configured, Envoy initialization will be delayed by upwards of 60 seconds.

Steps to Reproduce

  1. Configure a Virtual Node with TLS via ACM and active healthchecks
  2. Make this Virtual Node (indirectly) the provider for a Virtual Service
  3. Have another Virtual Node depend on the above Virtual Service
  4. Launch a new application in the above Virtual Node
  5. Observe an at least 60 second initialization period for the Envoy

Are you currently working around this issue?

Disabling one of healthchecks or TLS, or switching to Cloud Map based Service Discovery

Additional context

This occurs due to a race condition in Envoy in at least the 1.12.x series. Envoy initiates the first healthcheck on a cluster before the ACM-backed secret is retrieved, resulting in a health check connection failure. Specifically for DNS backed clusters, Envoy does not consider this a "round" of healthchecking and so waits for another round of healthchecks to occur. The same scenario occurs for Cloud Map based Service Discovery, but Envoy does consider this a round of healthchecks and so continues initialization.

Because there is no traffic on the cluster Envoy leverages a no_traffic_interval instead of the healthcheck interval, which by default is 60 seconds. After this interval, Envoy initiates another round of healthchecks which it then considers sufficient for continuing initialization.

dastbe commented 4 years ago

WIP issue against Envoy: https://github.com/envoyproxy/envoy/issues/12389

We're gathering some more information to help root cause the issue in Envoy.

lavignes commented 3 years ago

This should be fixed in Envoy 1.17

Y0Username commented 3 years ago

Correction: The fix wasn't actually in Envoy 1.17 Now tracking in https://github.com/envoyproxy/envoy/issues/17529