envoyproxy / go-control-plane

Go implementation of data-plane-api
Apache License 2.0
1.53k stars 515 forks source link

bug: envoy initial fetch time out when CDS updated. #1035

Open haorenfsa opened 1 month ago

haorenfsa commented 1 month ago

Server should send one more EDS response when any CDS update's ACK is received.

refs: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#xds-ack-nack

image image

In my test , after CDS updated in controller, there's no EDS sent, hence the cluster warming time out.

As I highlighted my logs below: The EDS update responses are sent during 08:17:55 - 08:18:46 before the CDS update (nonce=10) is ACKed by envoy.

image

At 08:21:46 (I set my envoy initial fetch timeout to 3min) client side printed initial fetch timeout. because after last CDS updates, there's no EDS response arrived.

image

PS: I'm using envoy gateway in my environment.

some logs are added by my debug version here:

image
valerian-roche commented 3 weeks ago

Hey, thanks for opening the issue. I believe this is the same underlying issue as https://github.com/envoyproxy/go-control-plane/issues/1001, which has now been addressed in envoy. The runtime flag mentioned in the documentation capture has now been defaulted to true as of envoy 1.32, which may not be the version used in your case. I do not believe the control-plane library will ever address this issue. This would require deep inspection of user resources which is the opposite direction we should take in my opinion. It would also tie more the control-plane with envoy, whereas we have been striving at supporting gRPC clients in recent times.

Can you confirm if you are still encountering this issue when using envoy 1.32.x or activating the runtime flag mentioned in the documentation?

haorenfsa commented 3 weeks ago

Thank you so much! @valerian-roche for the background catch up. I'm using v1.30.6. I'll try the solution you mentioned.

I do not believe the control-plane library will ever address this issue. This would require deep inspection of user resources which is the opposite direction we should take in my opinion. It would also tie more the control-plane with envoy, whereas we have been striving at supporting gRPC clients in recent times.

By the way, I fully agree with you that we should not make any changes that would tie control-plane with envoy.

I think the patch would introduce only a few changes, to make control-plane work with old envoy versions without affect other xDS clients. This enables users to adopt control-plane solutions like envoy-gateway for Kubernetes without changing data plane version. It would be great if you would reconsider with it.

Anyway, thank you again for taking time😊

valerian-roche commented 3 weeks ago

Given that envoy has now defaulted the fix I feel quite strongly that this issue does not justify this abstraction leakage. Users of envoy-gateway do not have to update the data-plane version, as they can simply activate the runtime flag to address the issue is envoy < 1.32. All supported version of envoy have a functional implementation of the EDS cache.

valerian-roche commented 3 weeks ago

@alecholmez FYI I discussed PR #1034 here. Imo as envoy fixed it in 1.32 as default we can keep the control-plane simpler

lukidzi commented 3 weeks ago

I think this could still be an issue. If a user sets initial_fetch_timeout to 0, it disables the timeout, causing Envoy to wait indefinitely.

Envoy waits for an EDS assignment until initial_fetch_timeout times out, and will then apply the cached assignment and finish updating the warmed cluster.

valerian-roche commented 15 hours ago

I am unclear why a user would set this value as this breaks the EDS behavior of envoy in such case. Can you clarify what use-case is expected to use this? Can you also open a new issue on envoy to get this behavior addressed? I do not believe there is a reason for envoy to not use the cache in this context (I am actually unclear why envoy does not use its cached value immediately as there is no real reason why the eds resource in cache would be invalid, and if it has changed it will still be received right after)