We have encountered a situation a few times where the connection from ArgoCD to an external cluster no longer works (UI shows unknown state for all applications of the corresponding cluster). In the past, we fixed the problem with the procedure described here. Today we took a closer look at this recurring problem, gathered some more detailed information about the situation and we think we have found the "real" cause.
To Reproduce
Error messages like this can be found in ArgoCD log for all applications:
argocd-application-controller-0 argocd-application-controller time="2024-07-12T08:02:21Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-07-11T20:17:21Z\",\"message\":\"Failed to load live state: failed to get cluster info for \\\"https://k8s-adm-222-0010:6443\\\": error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-07-12T08:02:21Z\",\"message\":\"Failed to load target state: failed to get cluster version for cluster \\\"https://k8s-adm-222-0010:6443\\\": failed to get cluster info for \\\"https://k8s-adm-222-0010:6443\\\": error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-07-12T08:02:21Z\",\"message\":\"error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"UnknownError\"}]}}" application=argocd/k8s-adm-222-0010--metrics-server
The kube-apiserver of the corresponding external cluster shows error messages like this for each ArgoCD connection attempt:
kube-apiserver-k8s-adm-222-0011 kube-apiserver E0711 20:05:26.136116 1 authentication.go:73] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2024-07-11T20:05:26Z is after 2024-07-11T14:30:51Z, verifying certificate SN=3514383209763152651, SKID=, AKID=67:85:CE:27:EA:FD:61:F8:89:53:EE:38:80:D0:D6:4B:41:4C:CA:43 failed: x509: certificate has expired or is not yet valid: current time 2024-07-11T20:05:26Z is after 2024-07-11T14:30:51Z]"
We thought, that we were using bearer token authentication between ArgoCD and the external clusters, but it seem, we were wrong:
$ argocd login argocd
Username: admin
Password:
'admin:login' logged in successfully
Context 'argocd' updated
$ argocd cluster rotate-auth k8s-adm-222-0010
FATA[0000] rpc error: code = InvalidArgument desc = Cluster 'https://k8s-adm-222-0010:6443' does not use bearer token authentication
The ServiceAccount/Bearer Token should be long-lived, see annotation explained in this reference, but this seem to not matter in this case. Just for your information:
===> certificate serial number matches with the on from the external cluster kube-apiserver error message
===> it is the same certificate of the external cluster kubernetes-admin, which was used during argocd cluster add operation
Expected behavior
We either want to use authentication based on the long-lived ServiceAccount/Bearer Token or an option, better an automatism, that rotates the TLS client cert.
Checklist:
argocd version
.Describe the bug
We have encountered a situation a few times where the connection from ArgoCD to an external cluster no longer works (UI shows unknown state for all applications of the corresponding cluster). In the past, we fixed the problem with the procedure described here. Today we took a closer look at this recurring problem, gathered some more detailed information about the situation and we think we have found the "real" cause.
To Reproduce
Error messages like this can be found in ArgoCD log for all applications:
The kube-apiserver of the corresponding external cluster shows error messages like this for each ArgoCD connection attempt:
We thought, that we were using bearer token authentication between ArgoCD and the external clusters, but it seem, we were wrong:
The ServiceAccount/Bearer Token should be long-lived, see annotation explained in this reference, but this seem to not matter in this case. Just for your information:
While checking the ArgoCD secrets we found that it includes a TLS client certificate in the config blob, which has expired:
===> certificate serial number matches with the on from the external cluster kube-apiserver error message ===> it is the same certificate of the external cluster kubernetes-admin, which was used during
argocd cluster add
operationExpected behavior
We either want to use authentication based on the long-lived ServiceAccount/Bearer Token or an option, better an automatism, that rotates the TLS client cert.
Screenshots
Version
Logs
Thank you very much for taking care of this issue. We would be pleased if you could give us a permanent solution.