envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.72k stars 4.75k forks source link

EDS multiplexing causes EDS initial fetch failure #25195

Open xjtian opened 1 year ago

xjtian commented 1 year ago

Description:

We're running a GRPC SoTW control plane for xDS. When we deployed a changeset which included https://github.com/envoyproxy/envoy/pull/22419, Envoy failed to fetch EDS from the control plane at all, timing out on initial EDS fetch. When we disabled the feature with the runtime flag envoy.reloadable_features.multiplex_eds and restarted Envoy, EDS worked properly again.

Apart from EDS failures, we saw many occurrences of this log message which has never come up before (looks like maybe 1 per EDS cluster):

[warning][config] [external/envoy/source/common/config/grpc_stream.h:65] gRPC bidi stream to envoymanager for rpc StreamEndpoints(stream .envoy.service.discovery.v3.DiscoveryRequest) returns (stream .envoy.service.discovery.v3.DiscoveryResponse); already exists!

The PR was reverted in https://github.com/envoyproxy/envoy/pull/25157 for a (maybe unrelated) reason so I'm filing this just for SA when the EDS muxing change goes back in.

mattklein123 commented 1 year ago

Thanks for the report. cc @nezdolik

nezdolik commented 1 year ago

@xjtian which control plane do you use? Is it go-control-plane by any chance?

xjtian commented 1 year ago

Yeah, we're using GCP (v0.10.3-0.20221215163201-b9a8bb7af6f7)

ggreenway commented 1 year ago

Maybe when this is added back in, it should be config-guarded instead of enabled-by-default with a runtime setting to override? It wouldn't surprise me if there are other control planes out there that also hit issues with this change.

nezdolik commented 1 year ago

Collected symptoms of failing xds flow in go control plane (CP) for eds sotw. Now as a single stream is used, the failing flow looks like:

We have not bumped into this problem with our custom control plane (based on java oss control plane lib) most likely due the fact that we transform eds streams into ads streams in our control plane and proxy those requests to another control plane via ads.

@ggreenway yes, that would be a graceful way to introduce such change.

I need some time to think what would be a proper fix.