update to 2.12.3 causes managed clusters and applications to go into an unknown state (HA mode)

hjsauce commented 1 week ago

Checklist:

[x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[x] I've included steps to reproduce the bug.
[x] I've pasted the output of argocd version.

Describe the bug

After an update to 2.12.3, I noticed that some of the clusters were in an unknown state. Note that this does not happen immediately. It happens over time.

I am using ArgoCD in HA mode with consistent-hashing . I have configured 3 replicas for the controller.

- name: ARGOCD_CONTROLLER_REPLICAS                                                                                                                                                                                                                                                                                       
   value: "3"

I am currently able to fix the issue by restarting the application-controller statefulset.

To Reproduce

Update to 2.12.3
Wait for some time

Expected behavior

ArgoCD works as expected and I can sync applications

Screenshots Cluster errors Application errors

After restart of the controller statefulset

Version

argocd: v2.12.3+6b9cd82
  BuildDate: 2024-08-27T15:33:41Z
  GitCommit: 6b9cd828c6e9807398869ad5ac44efd2c28422d6
  GitTreeState: clean
  GoVersion: go1.23.0
  Compiler: gc
  Platform: darwin/arm64
argocd-server: v2.12.3+6b9cd82
  BuildDate: 2024-08-27T11:57:48Z
  GitCommit: 6b9cd828c6e9807398869ad5ac44efd2c28422d6
  GitTreeState: clean
  GoVersion: go1.22.4
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.4.2 2024-05-22T15:19:38Z
  Helm Version: v3.15.2+g1a500d5
  Kubectl Version: v0.29.6
  Jsonnet Version: v0.20.0

Logs

time="2024-09-09T08:25:39Z" level=error msg="finished unary call with code Unknown" error="error getting cached app resource tree: ComparisonError: Failed to load live state: failed to get cluster info for \"https://10.19.0.18\": error getting cluster: controller is configured to ignore cluster https://10.19.0.18" grpc.code=Unknown grpc.method=ResourceTree grpc.service=application.ApplicationService grpc.start_time="2024-09-09T08:25:39Z" grpc.time_ms=2.143 span.kind=server system=grpc

dmarquez-splunk commented 1 week ago

Also seeing this issue and seems related to this issue: https://github.com/argoproj/argo-cd/issues/19854

jjaygohil commented 1 day ago

Facing the same issue. We have an alert to monitor argocd_cluster_connection_status metric to detect this situation , however the metrics also stops reporting for the affected cluster when this occurs. Hence alert is not firing. We are using consistent-hashing for sharding.

argoproj / argo-cd

update to 2.12.3 causes managed clusters and applications to go into an unknown state (HA mode) #19848