Open DonOtuseGH opened 4 months ago
Do you need further information to investigate this issue?
Is there anything we can contribute to further analyzing, testing or finding a solution?
ArgoCD versions 2.10 and below have reached EOL. Can you upgrade and let us know if the issue is still present, please?
What a coincidence - we are currently updating to version 2.12.7.
However, the earliest expiration date of our cluster client certificates is about 211 days, so we can't say for sure whether the problem still exists with the current version of ArgoCD.
In your opinion, which commit should have fixed the problem?
Sorry, I don't know. I have ~1600 bugs to triage and label and can't triage all of them unfortunately.
I've overseen one DEV cluster, which got alerted today for expired ArgoCD TLS client certificate. So please can you remove the version: EOL label and investigate the issue, as it is not solved with the latest version of ArgoCD? Thank you!
Please find below up2date information:
ArgoCD version:
argocd: v2.12.7+4d70c51
BuildDate: 2024-11-05T15:30:59Z
GitCommit: 4d70c51e64e534ffe656c45317037b2bcdaa69f9
GitTreeState: clean
GoVersion: go1.22.4
Compiler: gc
Platform: linux/amd64
Example ArgoCD log message:
argocd-application-controller-0 argocd-application-controller time="2024-11-15T14:05:14Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-11-15T13:35:14Z\",\"message\":\"Failed to load live state: failed to get cluster info for \\\"https://k8s-adm-901-0010:6443\\\": error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-11-15T14:05:14Z\",\"message\":\"Failed to load target state: failed to get cluster version for cluster \\\"https://k8s-adm-901-0010:6443\\\": failed to get cluster info for \\\"https://k8s-adm-901-0010:6443\\\": error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-11-15T14:05:14Z\",\"message\":\"error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"UnknownError\"}]}}" app-namespace=argocd app-qualified-name=argocd/k8s-adm-901-0010--metrics-server application=k8s-adm-901-0010--metrics-server project=default
Example kube-apiserver log message:
kube-apiserver-k8s-adm-901-0011 kube-apiserver E1115 14:05:14.858137 1 authentication.go:73] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2024-11-15T14:05:14Z is after 2024-11-15T10:11:37Z, verifying certificate SN=4813121073675563764, SKID=, AKID=23:56:FA:C8:E8:A5:9A:91:89:97:89:3C:FA:97:D4:E8:E9:AB:0E:15 failed: x509: certificate has expired or is not yet valid: current time 2024-11-15T14:05:14Z is after 2024-11-15T10:11:37Z]"
long-lived ServiceAccount/Bearer Token with annotation:
$ kubectl describe secrets -n kube-system argocd-manager-token-8vcb2
Name: argocd-manager-token-8vcb2
Namespace: kube-system
Labels: <none>
Annotations: kubernetes.io/service-account.name: argocd-manager
kubernetes.io/service-account.uid: 95d26c99-6ea7-4673-803f-d81a1e20f16c
...
ArgoCD secret with expired TLS client certificate in the config blob:
$ kubectl describe secrets -n argocd cluster-k8s-adm-901-0010.tbadm.net-16864154
Name: cluster-k8s-adm-901-0010.tbadm.net-16864154
Namespace: argocd
Labels: argocd.argoproj.io/secret-type=cluster
Annotations: managed-by: argocd.argoproj.io
Type: Opaque
Data
====
config: 5317 bytes
name: 16 bytes
server: 39 bytes
$ kubectl get secrets -n argocd cluster-k8s-adm-901-0010.tbadm.net-16864154 -o json | jq -r '.data|[.name, .config]|@tsv' | while read -r name config; do echo -n '### '; base64 -d <<< $name; echo; base64 -d <<< $config | jq -r .tlsClientConfig.certData | base64 -d | openssl x509 -noout -issuer -subject -dates -serial; done
### k8s-adm-901-0010
issuer=CN = kubernetes
subject=O = system:masters, CN = kubernetes-admin
notBefore=Dec 19 09:57:38 2022 GMT
notAfter=Nov 15 10:11:37 2024 GMT
serial=42CBA41D9160F2F4
$ hex2dec 42CBA41D9160F2F4
4813121073675563764
===> certificate serial number matches with the one from the external cluster kube-apiserver error message
Checklist:
argocd version
.Describe the bug
We have encountered a situation a few times where the connection from ArgoCD to an external cluster no longer works (UI shows unknown state for all applications of the corresponding cluster). In the past, we fixed the problem with the procedure described here. Today we took a closer look at this recurring problem, gathered some more detailed information about the situation and we think we have found the "real" cause.
To Reproduce
Error messages like this can be found in ArgoCD log for all applications:
The kube-apiserver of the corresponding external cluster shows error messages like this for each ArgoCD connection attempt:
We thought, that we were using bearer token authentication between ArgoCD and the external clusters, but it seem, we were wrong:
The ServiceAccount/Bearer Token should be long-lived, see annotation explained in this reference, but this seem to not matter in this case. Just for your information:
While checking the ArgoCD secrets we found that it includes a TLS client certificate in the config blob, which has expired:
===> certificate serial number matches with the on from the external cluster kube-apiserver error message ===> it is the same certificate of the external cluster kubernetes-admin, which was used during
argocd cluster add
operationExpected behavior
We either want to use authentication based on the long-lived ServiceAccount/Bearer Token or an option, better an automatism, that rotates the TLS client cert.
Screenshots
Version
Logs
Thank you very much for taking care of this issue. We would be pleased if you could give us a permanent solution.