envoyproxy / go-control-plane

Go implementation of data-plane-api
Apache License 2.0
1.52k stars 514 forks source link

Unexpected gRPC Timeout on EDS Update with Delta xDS #1001

Closed sefaphlvn closed 2 weeks ago

sefaphlvn commented 1 month ago

I am using go-control-plane v13 with Delta ADS and snapshots. The initial snapshot works correctly, and Envoy successfully fetches all configurations when it starts. However, when I update the snapshot with changes specifically in the Cluster Discovery Service (CDS), I encounter the following error in Envoy:

[2024-09-12 14:09:24.548][33155414][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 1 cluster(s), remove 0 cluster(s)
[2024-09-12 14:09:24.550][33155414][info][upstream] [source/common/upstream/cds_api_helper.cc:71] cds: added/updated 1 cluster(s), skipped 0 unmodified cluster(s)
[2024-09-12 14:09:34.548][33155414][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment

envoy version: 1.28.0

Additional Information:

Please let me know if additional logs or information are needed. I am looking for guidance on whether this could be a bug in go-control-plane or if there are specific configurations or steps that I might be missing.

valerian-roche commented 1 month ago

Hey, I think you are encountering https://github.com/envoyproxy/envoy/issues/26749. This issue has been addressed in envoy v1.28 but it is conditioned by a runtime flag. The flag was just switched to on by default in v1.31. Can you activate the flag on your instances and confirm if it fixes it? We internally used it since v1.28 and it alleviated the issue. Prior to this we had to had deep-hooks in the control-plane to do so, which were not upstreamed given their brittleness.

sefaphlvn commented 1 month ago

Thank you for your response! I activated the flag as you suggested, and it resolved the issue. The Envoy instance now correctly retains the cached ClusterLoadAssignment when the initial fetch times out, so the cluster members are maintained and do not disappear.

However, I still see the initial fetch timeout warning, which seems like it will continue until a permanent solution is found for this issue.

[2024-09-12 18:30:07.187][34060614][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment
[2024-09-12 18:30:07.187][34060614][debug][upstream] [source/extensions/clusters/eds/eds.cc:453] Did not receive EDS response on time, using cached ClusterLoadAssignment for cluster test
[2024-09-12 18:30:07.187][34060614][debug][upstream] [source/common/upstream/upstream_impl.cc:469] transport socket match, socket default selected for host with address 192.168.1.2:12

Thanks again for pointing me in the right direction!

sefaphlvn commented 1 month ago

It appears that there’s another case. In my setup, I have two distinct listeners, each with its own HTTP Connection Manager (HCM) filter. Both HCM filters are linked to an RDS named “v28rds.”

When I attempt to update the route for one of the HCM filters by changing the RDS configuration to use a new route configuration named “route_for_ccc” the RDS definition of the relevant HCM in the config_dump reflects this change like that:

"rds": {
    "config_source": {
        "ads": {},
        "initial_fetch_timeout": "10s",
        "resource_api_version": "V3"
    },
    "route_config_name": "route_for_ccc"
},

However, the new route configuration does not appear in the config_dump.​

This indicates that the updated route is not being applied correctly, as it doesn’t even show up in the config_dump. Could you provide insight into whether this is a known limitation or suggest any steps to ensure that Envoy properly updates and applies the new route configuration without requiring a restart?

[2024-09-13 10:43:04.594][37022954][debug][main] [source/server/server.cc:237] flushing stats
[2024-09-13 10:43:09.292][37022954][debug][http2] [source/common/http/http2/codec_impl.cc:1803] [Tags: "ConnectionId":"0"] Http2Visitor::OnFrameHeader(1, 497, 0, 0)
[2024-09-13 10:43:09.292][37022954][debug][http2] [source/common/http/http2/codec_impl.cc:1855] [Tags: "ConnectionId":"0"] Http2Visitor::OnBeginDataForStream(1, 497)
[2024-09-13 10:43:09.292][37022954][debug][http2] [source/common/http/http2/codec_impl.cc:1867] [Tags: "ConnectionId":"0"] Http2Visitor: remaining data payload: 497, end_stream: false
[2024-09-13 10:43:09.292][37022954][debug][http2] [source/common/http/http2/codec_impl.cc:1896] [Tags: "ConnectionId":"0"] Http2Visitor dispatching DATA for stream 1
[2024-09-13 10:43:09.293][37022954][debug][config] [source/extensions/config_subscription/grpc/new_grpc_mux_impl.cc:143] Received DeltaDiscoveryResponse for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig at version 10
[2024-09-13 10:43:09.293][37022954][debug][filter] [source/common/filter/config_discovery_impl.cc:132] Updated filter config eeeeeeeoHADD-fcjyJPtD-filter accepted, posting to workers
[2024-09-13 10:43:09.293][37022954][debug][init] [source/common/init/manager_impl.cc:24] added target RdsRouteConfigSubscription RDS local-init-target route_for_ccc to init manager RDS local-init-manager route_for_ccc
[2024-09-13 10:43:09.293][37022954][debug][config] [./source/common/http/filter_chain_helper.h:111]     http filter #0
[2024-09-13 10:43:09.293][37022954][debug][config] [./source/common/http/filter_chain_helper.h:173]       dynamic filter name: http-filters-bgBMRm
[2024-09-13 10:43:09.293][37022954][debug][filter] [source/common/filter/config_discovery_impl.cc:146] Updated filter config eeeeeeeoHADD-fcjyJPtD-filter created, warming done
[2024-09-13 10:43:09.293][37022954][debug][config] [source/extensions/config_subscription/grpc/delta_subscription_state.cc:262] Delta config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig accepted with 1 resources added, 0 removed
[2024-09-13 10:43:09.300][37022954][debug][init] [source/common/init/watcher_impl.cc:31] init manager RDS local-init-manager v28rds destroyed
[2024-09-13 10:43:09.300][37022954][debug][init] [source/common/init/target_impl.cc:34] target RdsRouteConfigSubscription RDS local-init-target v28rds destroyed
[2024-09-13 10:43:09.300][37022954][debug][init] [source/common/init/watcher_impl.cc:31] RDS local-init-watcher v28rds destroyed
[2024-09-13 10:43:09.300][37022954][debug][init] [source/common/init/target_impl.cc:68] shared target RdsRouteConfigSubscription RDS init v28rds destroyed
[2024-09-13 10:43:09.300][37022954][debug][init] [source/common/init/target_impl.cc:34] target DynamicFilterConfigProviderImpl destroyed
[2024-09-13 10:43:09.300][37022954][debug][filter] [source/common/filter/config_discovery_impl.cc:181] Filter config eeeeeeeoHADD-fcjyJPtD-filter worker update complete
github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 weeks ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

valerian-roche commented 1 week ago

Hey, sorry for the delayed reply. I am not as familiar with the RDS client code in envoy, so I cannot answer for sure, but you may want to open an issue on envoy. As the code of RDS is share-nothing with the CDS code I expect the issue to be different from the one solved by the ADS cache in EDS. You may also want to test the control-plane version of branch dd/main in this fork, as there are multiple fixes for delta xDS done there which have not been upstreamed yet.

haorenfsa commented 1 week ago

When I attempt to update the route for one of the HCM filters by changing the RDS configuration to use a new route configuration named “route_for_ccc” the RDS definition of the relevant HCM in the config_dump reflects this change like that:

"rds": {
"config_source": {
"ads": {},
"initial_fetch_timeout": "10s",
"resource_api_version": "V3"
},
"route_config_name": "route_for_ccc"
},

Your error is for EDS response timeout:

initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment.

The configs should be changed for eds not rds.

sefaphlvn commented 1 week ago

When I attempt to update the route for one of the HCM filters by changing the RDS configuration to use a new route configuration named “route_for_ccc” the RDS definition of the relevant HCM in the config_dump reflects this change like that:

"rds": {
    "config_source": {
        "ads": {},
        "initial_fetch_timeout": "10s",
        "resource_api_version": "V3"
    },
    "route_config_name": "route_for_ccc"
},

Your error is for EDS response timeout:

initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment.

The configs should be changed for eds not rds.

I asked 2 separate questions, my 2nd question about rds. When the rds name is updated, I add it to the new rds snapshot, but it does not come and take it until the envoy restarts.