I have observed that If the server key of an existing cluster is modified, the application controller assigned to that cluster will report on the old endpoint indefinitely (?) until the application-controller pod is re-created.
To Reproduce
Declare a cluster called foo pointing to server1
Declare an application deploying to cluster foo
Change the server of cluster foo and point it to endpoint server2
Sync the application so it deploys to the new endpoint of cluster foo (server2).
Now there should be two series on the application controller metrics endpoint, reporting on both endpoints (some labels removed):
The first sample has 0 as value because in my case that endpoint was immediately disabled after changing Argo CD's configuration.
Expected behavior
I think that no status should be reported about server1 as soon as the endpoint is not known anymore to Argo CD (or at least as soon as it's not needed anymore). Deleting the pod (to force a re-creation) of the offending application controller restores the desired behavior and the metric for server1 is immediately gone from the metrics endpoint. In other words, only information about server2 is reported (as expected, IMO).
The impact of this bug is that monitoring argocd_cluster_connection_status and alarming when any value is 0 might trigger "false positives" as, once the cluster configuration is changed, the old endpoint could be gone at any moment due to the cluster being deleted (see above server1 reporting 0 as value). In other words, observing argocd_cluster_connection_status might not be trustworthy due to this behavior.
Would that metric have been purged without having to restart the controller if I had set --metrics-cache-expiration (disabled by default)? If so, could the documentation be updated to describe that cluster endpoint information is also "cached" and the status reported? Otherwise, please consider this report a bug.
Are there some kind of events we can watch for these changes? For example, when Pods change, there's a Kubernetes watch event that we process. I wonder if we can do something similar here.
I have observed that If the
server
key of an existing cluster is modified, the application controller assigned to that cluster will report on the old endpoint indefinitely (?) until the application-controller pod is re-created.To Reproduce
foo
pointing toserver1
foo
server
of clusterfoo
and point it to endpointserver2
foo
(server2
).Now there should be two series on the application controller metrics endpoint, reporting on both endpoints (some labels removed):
The first sample has 0 as value because in my case that endpoint was immediately disabled after changing Argo CD's configuration.
Expected behavior
I think that no status should be reported about
server1
as soon as the endpoint is not known anymore to Argo CD (or at least as soon as it's not needed anymore). Deleting the pod (to force a re-creation) of the offending application controller restores the desired behavior and the metric forserver1
is immediately gone from the metrics endpoint. In other words, only information aboutserver2
is reported (as expected, IMO).The impact of this bug is that monitoring
argocd_cluster_connection_status
and alarming when any value is0
might trigger "false positives" as, once the cluster configuration is changed, the old endpoint could be gone at any moment due to the cluster being deleted (see aboveserver1
reporting0
as value). In other words, observingargocd_cluster_connection_status
might not be trustworthy due to this behavior.Would that metric have been purged without having to restart the controller if I had set
--metrics-cache-expiration
(disabled by default)? If so, could the documentation be updated to describe that cluster endpoint information is also "cached" and the status reported? Otherwise, please consider this report a bug.Version