argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.02k stars 5.5k forks source link

external cluster TLS client cert has expired #19033

Open DonOtuseGH opened 4 months ago

DonOtuseGH commented 4 months ago

Checklist:

Describe the bug

We have encountered a situation a few times where the connection from ArgoCD to an external cluster no longer works (UI shows unknown state for all applications of the corresponding cluster). In the past, we fixed the problem with the procedure described here. Today we took a closer look at this recurring problem, gathered some more detailed information about the situation and we think we have found the "real" cause.

To Reproduce

Error messages like this can be found in ArgoCD log for all applications:

argocd-application-controller-0 argocd-application-controller time="2024-07-12T08:02:21Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-07-11T20:17:21Z\",\"message\":\"Failed to load live state: failed to get cluster info for \\\"https://k8s-adm-222-0010:6443\\\": error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-07-12T08:02:21Z\",\"message\":\"Failed to load target state: failed to get cluster version for cluster \\\"https://k8s-adm-222-0010:6443\\\": failed to get cluster info for \\\"https://k8s-adm-222-0010:6443\\\": error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-07-12T08:02:21Z\",\"message\":\"error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"UnknownError\"}]}}" application=argocd/k8s-adm-222-0010--metrics-server

The kube-apiserver of the corresponding external cluster shows error messages like this for each ArgoCD connection attempt:

kube-apiserver-k8s-adm-222-0011 kube-apiserver E0711 20:05:26.136116       1 authentication.go:73] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2024-07-11T20:05:26Z is after 2024-07-11T14:30:51Z, verifying certificate SN=3514383209763152651, SKID=, AKID=67:85:CE:27:EA:FD:61:F8:89:53:EE:38:80:D0:D6:4B:41:4C:CA:43 failed: x509: certificate has expired or is not yet valid: current time 2024-07-11T20:05:26Z is after 2024-07-11T14:30:51Z]"

We thought, that we were using bearer token authentication between ArgoCD and the external clusters, but it seem, we were wrong:

$ argocd login argocd
Username: admin
Password:
'admin:login' logged in successfully
Context 'argocd' updated

$ argocd cluster rotate-auth k8s-adm-222-0010
FATA[0000] rpc error: code = InvalidArgument desc = Cluster 'https://k8s-adm-222-0010:6443' does not use bearer token authentication

The ServiceAccount/Bearer Token should be long-lived, see annotation explained in this reference, but this seem to not matter in this case. Just for your information:

$ kubectl describe secrets -n kube-system argocd-manager-token-n8qm2
Name:         argocd-manager-token-n8qm2
Namespace:    kube-system
Labels:       <none>
Annotations:  kubernetes.io/service-account.name: argocd-manager
              kubernetes.io/service-account.uid: 2ba34942-ca7d-49d4-92bf-e67e791c8955

Type:  kubernetes.io/service-account-token
...

While checking the ArgoCD secrets we found that it includes a TLS client certificate in the config blob, which has expired:

$ kubectl describe secrets -n argocd cluster-k8s-adm-222-0010-2645299244
Name:         cluster-k8s-adm-222-0010-2645299244
Namespace:    argocd
Labels:       argocd.argoproj.io/secret-type=cluster
Annotations:  managed-by: argocd.argoproj.io

Type:  Opaque

Data
====
server:  39 bytes
config:  5313 bytes
name:    16 bytes

$ kubectl get secrets -n argocd cluster-k8s-adm-222-0010-2645299244 -o json | jq -r '.data|[.name, .config]|@tsv' | while read -r name config; do echo -n '### '; base64 -d <<< $name; echo; base64 -d <<< $config | jq -r .tlsClientConfig.certData | base64 -d | openssl x509 -noout -issuer -subject -dates -serial; done
### k8s-adm-222-0010
issuer=CN = kubernetes
subject=O = system:masters, CN = kubernetes-admin
notBefore=Jul 12 14:30:50 2023 GMT
notAfter=Jul 11 14:30:51 2024 GMT
serial=30C598E8C687A30B

$ hex2dec 30C598E8C687A30B
3514383209763152651

===> certificate serial number matches with the on from the external cluster kube-apiserver error message ===> it is the same certificate of the external cluster kubernetes-admin, which was used during argocd cluster add operation

Expected behavior

We either want to use authentication based on the long-lived ServiceAccount/Bearer Token or an option, better an automatism, that rotates the TLS client cert.

Screenshots

Version

$ argocd version
argocd: v2.11.0+d3f33c0
  BuildDate: 2024-05-07T16:21:23Z
  GitCommit: d3f33c00197e7f1d16f2a73ce1aeced464b07175
  GitTreeState: clean
  GoVersion: go1.21.9
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.10.7+b060053
  BuildDate: 2024-04-15T08:45:08Z
  GitCommit: b060053b099b4c81c1e635839a309c9c8c1863e9
  GitTreeState: clean
  GoVersion: go1.21.3
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.2.1 2023-10-19T20:13:51Z
  Helm Version: v3.14.3+gf03cc04
  Kubectl Version: v0.26.11
  Jsonnet Version: v0.20.0

Logs

see above...

Thank you very much for taking care of this issue. We would be pleased if you could give us a permanent solution.

DonOtuseGH commented 4 months ago

Do you need further information to investigate this issue?

DonOtuseGH commented 2 months ago

Is there anything we can contribute to further analyzing, testing or finding a solution?

andrii-korotkov-verkada commented 2 weeks ago

ArgoCD versions 2.10 and below have reached EOL. Can you upgrade and let us know if the issue is still present, please?

DonOtuseGH commented 2 weeks ago

What a coincidence - we are currently updating to version 2.12.7.

However, the earliest expiration date of our cluster client certificates is about 211 days, so we can't say for sure whether the problem still exists with the current version of ArgoCD.

In your opinion, which commit should have fixed the problem?

andrii-korotkov-verkada commented 2 weeks ago

Sorry, I don't know. I have ~1600 bugs to triage and label and can't triage all of them unfortunately.

DonOtuseGH commented 1 week ago

I've overseen one DEV cluster, which got alerted today for expired ArgoCD TLS client certificate. So please can you remove the version: EOL label and investigate the issue, as it is not solved with the latest version of ArgoCD? Thank you!

Please find below up2date information:

ArgoCD version:

argocd: v2.12.7+4d70c51
  BuildDate: 2024-11-05T15:30:59Z
  GitCommit: 4d70c51e64e534ffe656c45317037b2bcdaa69f9
  GitTreeState: clean
  GoVersion: go1.22.4
  Compiler: gc
  Platform: linux/amd64

Example ArgoCD log message:

argocd-application-controller-0 argocd-application-controller time="2024-11-15T14:05:14Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-11-15T13:35:14Z\",\"message\":\"Failed to load live state: failed to get cluster info for \\\"https://k8s-adm-901-0010:6443\\\": error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-11-15T14:05:14Z\",\"message\":\"Failed to load target state: failed to get cluster version for cluster \\\"https://k8s-adm-901-0010:6443\\\": failed to get cluster info for \\\"https://k8s-adm-901-0010:6443\\\": error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-11-15T14:05:14Z\",\"message\":\"error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"UnknownError\"}]}}" app-namespace=argocd app-qualified-name=argocd/k8s-adm-901-0010--metrics-server application=k8s-adm-901-0010--metrics-server project=default

Example kube-apiserver log message:

kube-apiserver-k8s-adm-901-0011 kube-apiserver E1115 14:05:14.858137       1 authentication.go:73] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2024-11-15T14:05:14Z is after 2024-11-15T10:11:37Z, verifying certificate SN=4813121073675563764, SKID=, AKID=23:56:FA:C8:E8:A5:9A:91:89:97:89:3C:FA:97:D4:E8:E9:AB:0E:15 failed: x509: certificate has expired or is not yet valid: current time 2024-11-15T14:05:14Z is after 2024-11-15T10:11:37Z]"

long-lived ServiceAccount/Bearer Token with annotation:

$ kubectl describe secrets -n kube-system argocd-manager-token-8vcb2
Name:         argocd-manager-token-8vcb2
Namespace:    kube-system
Labels:       <none>
Annotations:  kubernetes.io/service-account.name: argocd-manager
              kubernetes.io/service-account.uid: 95d26c99-6ea7-4673-803f-d81a1e20f16c
...

ArgoCD secret with expired TLS client certificate in the config blob:

$ kubectl describe secrets -n argocd cluster-k8s-adm-901-0010.tbadm.net-16864154
Name:         cluster-k8s-adm-901-0010.tbadm.net-16864154
Namespace:    argocd
Labels:       argocd.argoproj.io/secret-type=cluster
Annotations:  managed-by: argocd.argoproj.io

Type:  Opaque

Data
====
config:  5317 bytes
name:    16 bytes
server:  39 bytes

$ kubectl get secrets -n argocd cluster-k8s-adm-901-0010.tbadm.net-16864154 -o json | jq -r '.data|[.name, .config]|@tsv' | while read -r name config; do echo -n '### '; base64 -d <<< $name; echo; base64 -d <<< $config | jq -r .tlsClientConfig.certData | base64 -d | openssl x509 -noout -issuer -subject -dates -serial; done
### k8s-adm-901-0010
issuer=CN = kubernetes
subject=O = system:masters, CN = kubernetes-admin
notBefore=Dec 19 09:57:38 2022 GMT
notAfter=Nov 15 10:11:37 2024 GMT
serial=42CBA41D9160F2F4

$ hex2dec 42CBA41D9160F2F4
4813121073675563764

===> certificate serial number matches with the one from the external cluster kube-apiserver error message