argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.98k stars 5.47k forks source link

argocd_cluster_connection_status still reporting on old endpoint after modifying cluster settings #20782

Open nbarrientos opened 1 week ago

nbarrientos commented 1 week ago

I have observed that If the server key of an existing cluster is modified, the application controller assigned to that cluster will report on the old endpoint indefinitely (?) until the application-controller pod is re-created.

To Reproduce

Now there should be two series on the application controller metrics endpoint, reporting on both endpoints (some labels removed):

argocd_cluster_connection_status{
  container="application-controller", 
  endpoint="http-metrics", 
  job="argocd-application-controller-metrics", 
  k8s_version="1.31", 
  namespace="argocd", 
  pod="argo-argocd-application-controller-1", 
  server="https://server1:6443", 
  service="argo-argocd-application-controller-metrics"} 0

argocd_cluster_connection_status{
  container="application-controller", 
  endpoint="http-metrics", 
  job="argocd-application-controller-metrics", 
  k8s_version="1.31", namespace="argocd", 
  pod="argo-argocd-application-controller-1", 
  server="https://server2:6443", 
  service="argo-argocd-application-controller-metrics"} 1

The first sample has 0 as value because in my case that endpoint was immediately disabled after changing Argo CD's configuration.

Expected behavior

I think that no status should be reported about server1 as soon as the endpoint is not known anymore to Argo CD (or at least as soon as it's not needed anymore). Deleting the pod (to force a re-creation) of the offending application controller restores the desired behavior and the metric for server1 is immediately gone from the metrics endpoint. In other words, only information about server2 is reported (as expected, IMO).

The impact of this bug is that monitoring argocd_cluster_connection_status and alarming when any value is 0 might trigger "false positives" as, once the cluster configuration is changed, the old endpoint could be gone at any moment due to the cluster being deleted (see above server1 reporting 0 as value). In other words, observing argocd_cluster_connection_status might not be trustworthy due to this behavior.

Would that metric have been purged without having to restart the controller if I had set --metrics-cache-expiration (disabled by default)? If so, could the documentation be updated to describe that cluster endpoint information is also "cached" and the status reported? Otherwise, please consider this report a bug.

Version

argocd-server: v2.12.6+4dab5bd
  BuildDate: 2024-10-18T17:39:26Z
  GitCommit: 4dab5bd6a60adea12e084ad23519e35b710060a2
  GitTreeState: clean
  GoVersion: go1.22.4
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.4.2 2024-05-22T15:19:38Z
  Helm Version: v3.15.2+g1a500d5
  Kubectl Version: v0.29.6
  Jsonnet Version: v0.20.0
andrii-korotkov-verkada commented 5 days ago

Are there some kind of events we can watch for these changes? For example, when Pods change, there's a Kubernetes watch event that we process. I wonder if we can do something similar here.