Bug: Active healthchecks, TLS, and DNS service discovery on a Virtual Node can delay Envoy initialization

dastbe commented 4 years ago

Summary

When a source Virtual Node routes to a destination Virtual Node that has both TLS and active healthchecks configured, Envoy initialization will be delayed by upwards of 60 seconds.

Steps to Reproduce

Configure a Virtual Node with TLS via ACM and active healthchecks
Make this Virtual Node (indirectly) the provider for a Virtual Service
Have another Virtual Node depend on the above Virtual Service
Launch a new application in the above Virtual Node
Observe an at least 60 second initialization period for the Envoy

Are you currently working around this issue?

Disabling one of healthchecks or TLS, or switching to Cloud Map based Service Discovery

Additional context

This occurs due to a race condition in Envoy in at least the 1.12.x series. Envoy initiates the first healthcheck on a cluster before the ACM-backed secret is retrieved, resulting in a health check connection failure. Specifically for DNS backed clusters, Envoy does not consider this a "round" of healthchecking and so waits for another round of healthchecks to occur. The same scenario occurs for Cloud Map based Service Discovery, but Envoy does consider this a round of healthchecks and so continues initialization.

Because there is no traffic on the cluster Envoy leverages a no_traffic_interval instead of the healthcheck interval, which by default is 60 seconds. After this interval, Envoy initiates another round of healthchecks which it then considers sufficient for continuing initialization.

dastbe commented 4 years ago

WIP issue against Envoy: https://github.com/envoyproxy/envoy/issues/12389

We're gathering some more information to help root cause the issue in Envoy.

lavignes commented 3 years ago

This should be fixed in Envoy 1.17

Y0Username commented 3 years ago

Correction: The fix wasn't actually in Envoy 1.17 Now tracking in https://github.com/envoyproxy/envoy/issues/17529

aws / aws-app-mesh-roadmap

Bug: Active healthchecks, TLS, and DNS service discovery on a Virtual Node can delay Envoy initialization #227