argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.86k stars 5.45k forks source link

external cluster TLS client cert has expired #19033

Open DonOtuseGH opened 4 months ago

DonOtuseGH commented 4 months ago

Checklist:

Describe the bug

We have encountered a situation a few times where the connection from ArgoCD to an external cluster no longer works (UI shows unknown state for all applications of the corresponding cluster). In the past, we fixed the problem with the procedure described here. Today we took a closer look at this recurring problem, gathered some more detailed information about the situation and we think we have found the "real" cause.

To Reproduce

Error messages like this can be found in ArgoCD log for all applications:

argocd-application-controller-0 argocd-application-controller time="2024-07-12T08:02:21Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-07-11T20:17:21Z\",\"message\":\"Failed to load live state: failed to get cluster info for \\\"https://k8s-adm-222-0010:6443\\\": error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-07-12T08:02:21Z\",\"message\":\"Failed to load target state: failed to get cluster version for cluster \\\"https://k8s-adm-222-0010:6443\\\": failed to get cluster info for \\\"https://k8s-adm-222-0010:6443\\\": error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-07-12T08:02:21Z\",\"message\":\"error synchronizing cache state : the server has asked for the client to provide credentials\",\"type\":\"UnknownError\"}]}}" application=argocd/k8s-adm-222-0010--metrics-server

The kube-apiserver of the corresponding external cluster shows error messages like this for each ArgoCD connection attempt:

kube-apiserver-k8s-adm-222-0011 kube-apiserver E0711 20:05:26.136116       1 authentication.go:73] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2024-07-11T20:05:26Z is after 2024-07-11T14:30:51Z, verifying certificate SN=3514383209763152651, SKID=, AKID=67:85:CE:27:EA:FD:61:F8:89:53:EE:38:80:D0:D6:4B:41:4C:CA:43 failed: x509: certificate has expired or is not yet valid: current time 2024-07-11T20:05:26Z is after 2024-07-11T14:30:51Z]"

We thought, that we were using bearer token authentication between ArgoCD and the external clusters, but it seem, we were wrong:

$ argocd login argocd
Username: admin
Password:
'admin:login' logged in successfully
Context 'argocd' updated

$ argocd cluster rotate-auth k8s-adm-222-0010
FATA[0000] rpc error: code = InvalidArgument desc = Cluster 'https://k8s-adm-222-0010:6443' does not use bearer token authentication

The ServiceAccount/Bearer Token should be long-lived, see annotation explained in this reference, but this seem to not matter in this case. Just for your information:

$ kubectl describe secrets -n kube-system argocd-manager-token-n8qm2
Name:         argocd-manager-token-n8qm2
Namespace:    kube-system
Labels:       <none>
Annotations:  kubernetes.io/service-account.name: argocd-manager
              kubernetes.io/service-account.uid: 2ba34942-ca7d-49d4-92bf-e67e791c8955

Type:  kubernetes.io/service-account-token
...

While checking the ArgoCD secrets we found that it includes a TLS client certificate in the config blob, which has expired:

$ kubectl describe secrets -n argocd cluster-k8s-adm-222-0010-2645299244
Name:         cluster-k8s-adm-222-0010-2645299244
Namespace:    argocd
Labels:       argocd.argoproj.io/secret-type=cluster
Annotations:  managed-by: argocd.argoproj.io

Type:  Opaque

Data
====
server:  39 bytes
config:  5313 bytes
name:    16 bytes

$ kubectl get secrets -n argocd cluster-k8s-adm-222-0010-2645299244 -o json | jq -r '.data|[.name, .config]|@tsv' | while read -r name config; do echo -n '### '; base64 -d <<< $name; echo; base64 -d <<< $config | jq -r .tlsClientConfig.certData | base64 -d | openssl x509 -noout -issuer -subject -dates -serial; done
### k8s-adm-222-0010
issuer=CN = kubernetes
subject=O = system:masters, CN = kubernetes-admin
notBefore=Jul 12 14:30:50 2023 GMT
notAfter=Jul 11 14:30:51 2024 GMT
serial=30C598E8C687A30B

$ hex2dec 30C598E8C687A30B
3514383209763152651

===> certificate serial number matches with the on from the external cluster kube-apiserver error message ===> it is the same certificate of the external cluster kubernetes-admin, which was used during argocd cluster add operation

Expected behavior

We either want to use authentication based on the long-lived ServiceAccount/Bearer Token or an option, better an automatism, that rotates the TLS client cert.

Screenshots

Version

$ argocd version
argocd: v2.11.0+d3f33c0
  BuildDate: 2024-05-07T16:21:23Z
  GitCommit: d3f33c00197e7f1d16f2a73ce1aeced464b07175
  GitTreeState: clean
  GoVersion: go1.21.9
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.10.7+b060053
  BuildDate: 2024-04-15T08:45:08Z
  GitCommit: b060053b099b4c81c1e635839a309c9c8c1863e9
  GitTreeState: clean
  GoVersion: go1.21.3
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.2.1 2023-10-19T20:13:51Z
  Helm Version: v3.14.3+gf03cc04
  Kubectl Version: v0.26.11
  Jsonnet Version: v0.20.0

Logs

see above...

Thank you very much for taking care of this issue. We would be pleased if you could give us a permanent solution.

DonOtuseGH commented 3 months ago

Do you need further information to investigate this issue?

DonOtuseGH commented 2 months ago

Is there anything we can contribute to further analyzing, testing or finding a solution?