If a single cluster fails all other cluster are affected

aroundthecode commented 2 months ago

I've a single Argocd Instance configured with several ( ~50) remote clusters.

If one of the cluster stop working (in my case was an HAproxy crash making the endpoint unreachable) all application of that cluster continue to pool the cluster for resources and start getting timeouts and "Watch error".

This causes high resources consumption and the application-controller being oomkilled and restarted.

So basically one single cluster outage casuses outage on the whole clusters set since all other applications were suffering due to resource consumption and restarts.

Expected behavior If a cluster goes down for any problem a sort of "circuit breaker" should occur on its application which should stop polling until the cluster is back to work.

If this is not possible what would be any manual intervention I could perform to mitigate the problem on other, working clusters?

andrii-korotkov-verkada commented 2 weeks ago

As a workaround, you can configure one application controller replica per cluster and tune the resources accordingly. That may not work as well if clusters are of different sizes tho, leading to resource over-use.

andrii-korotkov-verkada commented 2 weeks ago

What's your ArgoCD version?

argoproj / argo-cd

If a single cluster fails all other cluster are affected #19816