argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.01k stars 5.49k forks source link

If a single cluster fails all other cluster are affected #19816

Open aroundthecode opened 2 months ago

aroundthecode commented 2 months ago

I've a single Argocd Instance configured with several ( ~50) remote clusters.

If one of the cluster stop working (in my case was an HAproxy crash making the endpoint unreachable) all application of that cluster continue to pool the cluster for resources and start getting timeouts and "Watch error".

This causes high resources consumption and the application-controller being oomkilled and restarted.

image

So basically one single cluster outage casuses outage on the whole clusters set since all other applications were suffering due to resource consumption and restarts.

Expected behavior If a cluster goes down for any problem a sort of "circuit breaker" should occur on its application which should stop polling until the cluster is back to work.

If this is not possible what would be any manual intervention I could perform to mitigate the problem on other, working clusters?

andrii-korotkov-verkada commented 2 weeks ago

As a workaround, you can configure one application controller replica per cluster and tune the resources accordingly. That may not work as well if clusters are of different sizes tho, leading to resource over-use.

andrii-korotkov-verkada commented 2 weeks ago

What's your ArgoCD version?