Open dastbe opened 4 years ago
WIP issue against Envoy: https://github.com/envoyproxy/envoy/issues/12389
We're gathering some more information to help root cause the issue in Envoy.
This should be fixed in Envoy 1.17
Correction: The fix wasn't actually in Envoy 1.17 Now tracking in https://github.com/envoyproxy/envoy/issues/17529
Summary
When a source Virtual Node routes to a destination Virtual Node that has both TLS and active healthchecks configured, Envoy initialization will be delayed by upwards of 60 seconds.
Steps to Reproduce
Are you currently working around this issue?
Disabling one of healthchecks or TLS, or switching to Cloud Map based Service Discovery
Additional context
This occurs due to a race condition in Envoy in at least the 1.12.x series. Envoy initiates the first healthcheck on a cluster before the ACM-backed secret is retrieved, resulting in a health check connection failure. Specifically for DNS backed clusters, Envoy does not consider this a "round" of healthchecking and so waits for another round of healthchecks to occur. The same scenario occurs for Cloud Map based Service Discovery, but Envoy does consider this a round of healthchecks and so continues initialization.
Because there is no traffic on the cluster Envoy leverages a
no_traffic_interval
instead of the healthcheck interval, which by default is 60 seconds. After this interval, Envoy initiates another round of healthchecks which it then considers sufficient for continuing initialization.