envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.98k stars 4.81k forks source link

Delay HC activation until SDS is initialized #17529

Open flashyang opened 3 years ago

flashyang commented 3 years ago

Description: Recently, we noticed that during the Envoy initialization, there is a race condition between when TLS is configured on a upstream cluster (like validation context) and when active healthcheck begins on that cluster, and it will take around 60s for the Envoy to initialize. In this case, we find that Envoy initiates the first healthcheck on the upstream cluster before the validation context is retrieved, resulting in a health check connection failure and the healthcheck interval will fall back to the no_traffic_interval (because there is no traffic on the cluster). While for Envoy cluster which uses STATIC_DNS and EDS this appears to not delay Envoy initialization, it appears that Envoy cluster using LOGICAL_DNS will wait out the no_traffic_interval to healthcheck again before it considers itself fully initialized.

I have create a same issue https://github.com/envoyproxy/envoy/issues/15977 before, but that issue was closed without a fix. The most close fix is this commit https://github.com/envoyproxy/envoy/pull/16236 but it wasn't be merged.

mattklein123 commented 3 years ago

So you are using SDS on the upstream cluster and we are not waiting for SDS to finish before starting health checking? Is that right? If so I agree this should be fixed.

flashyang commented 3 years ago

Yes, that's the case. I think we should delay the HC on the upstream cluster until the SDS resources are ready for them.

mattklein123 commented 3 years ago

OK yes we should fix this. Marking it help wanted. cc @lambdai

mpuncel commented 3 years ago

@lizan (since you added SDS support as far as I know), is this as simple as just moving healthchecker->start() to onInitDone()? That should only run after the init manager has finished with all targets which I believe includes an initial SDS secret load for all transport socket matches. I wonder if my PR is overly complicated.

lizan commented 3 years ago

That might work, but I'm not pretty sure.

tnsardesai commented 1 year ago

Hi are there any updates on this ticket? I believe this is still an ongoing issue