Closed itsLucario closed 1 month ago
@itsLucario Thanks for you reporting, seems it's a bug.
DFPClusters aren't part of the CDS and cluster manager removes it
Or maybe, we could make cluster manager do not to remove those DFPClusters?
@doujiang24 Yeah, that should work! We can avoid removing clusters during onConfigUpdate if their names match the DFPCluster prefix. This will help prevent frequent removal of DFPClusters if there are continuous CDS pushes.
Handling it in DFP filter to check if the cluster exists in the TLS would prevent the need for DFP-specific changes in cluster manager or cds_api_helper. However, this would also result in removing DFP clusters on every CDS update, which would increase DNS queries and affect performance.
I think it would be better to fix this at the cluster manager level like you mentioned. That way, we can avoid constantly deleting and recreating DFP clusters, and it won't impact performance.
In either of the case, kindly let me know. I would be happy to try and contribute the fix for this. Thank you
We can avoid removing clusters during onConfigUpdate if their names match the DFPCluster prefix
Maybe we can add a new flag in cluster, instead of prefix match.
It's okay to file a PR from my side, as the first contributor of sub_clusters_config
, but final decision from OWNERS @mattklein123 @alyssawilk
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Can this issue be reopened?
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Title: DynamicForwardProxy with sub_clusters_config get stuck with
Sub cluster warming timeout
after dynamic CDS add/updateDescription: I am trying to use dynamic_forward_proxy with
sub_clusters_config
to get the benefits that it provides. Things work fine at start and envoy is able to forward the request to upstream but the moment we receive the CDS and the cluster initialization happens and the DFPClusters gets removed. Once the DFPCluster removed after CDS, all the further requests results inSub cluster warming timeout
.We initially noticed this with Istio on CDS add/update and tried to reproduce the same on envoy with filesystem-based dynamic config.
We were able to identify the root cause for this and a potential fix! The dynamic_forward_proxy maintains a
cluster_map_
of its own. When CDS trigger happens, DFPClusters aren't part of the CDS and cluster manager removes it, but a stale entry gets left over in thecluster_map_
of dynamic_forward_proxy cluster. When this happens, the further requests dynamic_forward_proxy assumes that the cluster is already present and waits for a warmup and eventually times out as the cluster is already removed (Ref). This keeps happening until dynamic_forward_proxy removes it from its map after ttl expire.Potential Fix: We tried a potential fix, When enter the wait-for warmup condition (Ref), we check again with cluster manager if the cluster is present. If not, clear the cluster from DFP map and trigger the add subcluster again so the cluster gets added back to cluster manager and DFP map.
If you feel above-mentioned is a bug and not intended behavior, I can try to work on my first contribution to envoy. (Super excited 😄 )
Repro steps:
Below are the configuration I'm using to reproduce the issue:
envoy -c bootstrap.yaml -l debug
curl -vL -H 'Host: httpbin.org' http://localhost:10000/status/201
vi cds.yaml
make changes and saveSub cluster warming timeout
curl -vL -H 'Host: httpbin.org' http://localhost:10000/status/201
Config:
Logs: