argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.75k stars 5.41k forks source link

Start watch {...} Failed to watch {...} loop when target cluster paused #19182

Open ktomaszx opened 3 months ago

ktomaszx commented 3 months ago

Env Two eks clusters in seperate VPCs. eks1 with argocd eks2 with vClusters and apps deployed inside

Describe the bug I'm facing an issue with argocd controller which logs hundreds of Start watch {...} Failed to watch {...} It causes that NAT gateway active connections between VPCs are enormous, ~10-15x higher. The issue is visible each time vCluster is paused. As soon as argocd controller is restarted issue is gone. (no more Start watch... Failed to watch... logs and NAT active connections back to normal level)

To Reproduce Make target cluster unavailable (in my case vCluster paused) and observe controller logs. Then restart controller pod.

Expected behavior argocd controller doesn't need to be restarted when target cluster is unavailable

Version v2.11.2

Logs

level=info msg="Start watch WorkloadGroup.networking.istio.io on https://..." server="https://..."
level=info msg="Failed to watch AuthorizationPolicy.security.istio.io on https://...: Resyncing AuthorizationPolicy.security.istio.io on https://... due to timeout, retrying in 1s" server="https://..."
level=info msg="Start watch Predictor.serving.kserve.io on https://..." server="https://..."
level=info msg="Failed to watch ClusterIssuer.cert-manager.io on https://...: Resyncing ClusterIssuer.cert-manager.io on https://... due to timeout, retrying in 1s" server="https://..."
level=info msg="Failed to watch AllowedImages.constraints.gatekeeper.sh on https://...: Resyncing AllowedImages.constraints.gatekeeper.sh on https://... due to timeout, retrying in 1s" server="https://..."
level=info msg="Failed to watch ControllerRevision.apps on https://...: Resyncing ControllerRevision.apps on https://... due to timeout, retrying in 1s" server="https://..."
sambo2021 commented 1 week ago

hi Iam facing the same issue in argocd v2.12 my case is Argocd deployed on eks-v1.29 and deploy app of apps pattern deploying apps to its own cluster and resources of final apps like (deploy,config,secret,sts,...) to remote cluster eks-v1.30 and while getting this logs of "failed watch" some of apps got timeout syncying to remote cluster note: using sharding on 2 replica sts argocd controller, own cluster as shard 0 and remote cluster as shard 1 note: argocd on cluster-v1.29 is reaching remote cluster-v1.30 using privatelink