DataONEorg / k8s-cluster

Documentation on the DataONE Kubernetes cluster
Apache License 2.0
2 stars 1 forks source link

Renew client certificates on k8s-dev #34

Closed nickatnceas closed 1 year ago

nickatnceas commented 1 year ago

Reported by Melinda in the DataONE#dev-general Slack, it appears that the client certificates for the k8s-dev cluster have expired:

metadig@docker-dev-ucsb-1:~$ kubectl get pods,services --all-namespaces -o wide
Unable to connect to the server: x509: certificate has expired or is not yet valid: current time 2022-08-25T13:14:29-07:00 is after 2022-08-17T17:04:16Z
root@docker-dev-ucsb-1:~# sudo kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Aug 17, 2022 17:06 UTC   <invalid>       ca                      no
apiserver                  Aug 17, 2022 17:04 UTC   <invalid>       ca                      no
apiserver-etcd-client      Aug 17, 2022 17:04 UTC   <invalid>       etcd-ca                 no
apiserver-kubelet-client   Aug 17, 2022 17:04 UTC   <invalid>       ca                      no
controller-manager.conf    Aug 17, 2022 17:05 UTC   <invalid>       ca                      no
etcd-healthcheck-client    Aug 17, 2022 17:03 UTC   <invalid>       etcd-ca                 no
etcd-peer                  Aug 17, 2022 17:03 UTC   <invalid>       etcd-ca                 no
etcd-server                Aug 17, 2022 17:03 UTC   <invalid>       etcd-ca                 no
front-proxy-client         Aug 17, 2022 17:04 UTC   <invalid>       front-proxy-ca          no
scheduler.conf             Aug 17, 2022 17:05 UTC   <invalid>       ca                      no

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Jan 28, 2030 19:14 UTC   7y              no
etcd-ca                 Jan 28, 2030 19:14 UTC   7y              no
front-proxy-ca          Jan 28, 2030 19:14 UTC   7y              no

There appear to be several ways to renew them, according to https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/

nickatnceas commented 1 year ago

I ran the following to renew the certs and get the metadig account reconnected to k8s-dev:

root@docker-dev-ucsb-1:~# sudo kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.
root@docker-dev-ucsb-1:~# sudo kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Aug 25, 2023 20:45 UTC   364d            ca                      no
apiserver                  Aug 25, 2023 20:45 UTC   364d            ca                      no
apiserver-etcd-client      Aug 25, 2023 20:45 UTC   364d            etcd-ca                 no
apiserver-kubelet-client   Aug 25, 2023 20:45 UTC   364d            ca                      no
controller-manager.conf    Aug 25, 2023 20:45 UTC   364d            ca                      no
etcd-healthcheck-client    Aug 25, 2023 20:45 UTC   364d            etcd-ca                 no
etcd-peer                  Aug 25, 2023 20:45 UTC   364d            etcd-ca                 no
etcd-server                Aug 25, 2023 20:45 UTC   364d            etcd-ca                 no
front-proxy-client         Aug 25, 2023 20:45 UTC   364d            front-proxy-ca          no
scheduler.conf             Aug 25, 2023 20:45 UTC   364d            ca                      no

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Jan 28, 2030 19:14 UTC   7y              no
etcd-ca                 Jan 28, 2030 19:14 UTC   7y              no
front-proxy-ca          Jan 28, 2030 19:14 UTC   7y              no
cp /etc/kubernetes/admin.conf /home/metadig/.kube/config
chown metadig:metadig /home/metadig/.kube/config
metadig@docker-dev-ucsb-1:~$ kubectl get nodes
NAME                STATUS   ROLES                  AGE      VERSION
docker-dev-ucsb-1   Ready    control-plane,master   2y207d   v1.23.3
docker-dev-ucsb-2   Ready    <none>                 2y207d   v1.23.3
nickatnceas commented 1 year ago

I restarted kube-apiserver, kube-controller-manager, kube-scheduler and etcd,, per the instructions given in the output of the cert renew command above:

metadig@docker-dev-ucsb-1:~$ kubectl delete pod/kube-scheduler-docker-dev-ucsb-1 -n kube-system
pod "kube-scheduler-docker-dev-ucsb-1" deleted

metadig@docker-dev-ucsb-1:~$ kubectl delete pod/kube-apiserver-docker-dev-ucsb-1 -n kube-system
pod "kube-apiserver-docker-dev-ucsb-1" deleted

metadig@docker-dev-ucsb-1:~$ kubectl delete pod/kube-controller-manager-docker-dev-ucsb-1 -n kube-system
pod "kube-controller-manager-docker-dev-ucsb-1" deleted

metadig@docker-dev-ucsb-1:~$ kubectl delete pod/etcd-docker-dev-ucsb-1 -n kube-system
pod "etcd-docker-dev-ucsb-1" deleted

metadig@docker-dev-ucsb-1:~$ kubectl get pods -n kube-system
NAME                                        READY   STATUS    RESTARTS         AGE
calico-kube-controllers-6fd7b9848d-7wrr9    1/1     Running   1792 (17m ago)   165d
calico-node-78ttp                           1/1     Running   1 (98d ago)      165d
calico-node-hzznf                           1/1     Running   31 (96d ago)     165d
coredns-78fcd69978-qktrm                    1/1     Running   4 (96d ago)      373d
coredns-78fcd69978-xjqjz                    1/1     Running   4 (96d ago)      373d
etcd-docker-dev-ucsb-1                      1/1     Running   51 (96d ago)     114s
kube-apiserver-docker-dev-ucsb-1            1/1     Running   636              2m48s
kube-controller-manager-docker-dev-ucsb-1   1/1     Running   2205 (8d ago)    2m20s
kube-proxy-7hmdm                            1/1     Running   2 (98d ago)      373d
kube-proxy-x54mw                            1/1     Running   4 (96d ago)      373d
kube-scheduler-docker-dev-ucsb-1            1/1     Running   531 (8d ago)     4m31s
nickatnceas commented 1 year ago

Updating the existing *.config files in the metadig homedir with the new certificate-authority-data key from the /etc/kubernetes/admin.conf file appears to allow access just to the listed namespaces:

metadig@docker-dev-ucsb-1:~/.kube$ KUBECONFIG=/home/metadig/.kube/polder.config

metadig@docker-dev-ucsb-1:~/.kube$ kubectl get pods --all-namespaces
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:polder:polder" cannot list resource "pods" in API group "" at the cluster scope

metadig@docker-dev-ucsb-1:~/.kube$ kubectl get pods -n polder
NAME                          READY   STATUS      RESTARTS         AGE
crawl-27658080--1-z4bq7       0/1     Completed   0                22d
crawl-27668160--1-dr7d6       0/1     Completed   0                15d
crawl-27678240--1-t7dnn       0/1     Completed   0                8d
dev-gleaner-8b6b6c4c9-ghhln   3/3     Running     0                31d
dev-polder-78bccfcd46-h9rnb   1/1     Running     46 (3d21h ago)   27d
setup-gleaner--1-9ffw8        0/1     Completed   0                27d
nickatnceas commented 1 year ago

I gpg encrypted and emailed the polder.config file to Melinda, and after copying the certificate-authority-data: line to another polder.config file (with user dev-polder instead of polder) she reported that she can connect to the k8s-dev cluster again.

mbjones commented 1 year ago

ok, so I looked at the credentials in config-dev and compared them to the ones in root@k8s-dev-ucsb-1:/etc/kubernetes/admin.conf, and the client-certificate-data and client-key-data for the kubernetes-admin user did not match. I updated config-dev with the new info from admin.conf, and now everything works fine to log in to dev-k8s:

$ kubectl config use-context dev-k8s
Switched to context "dev-k8s".
$ kubectl get nodes
NAME                STATUS   ROLES                  AGE      VERSION
docker-dev-ucsb-1   Ready    control-plane,master   2y220d   v1.23.3
docker-dev-ucsb-2   Ready    <none>                 2y220d   v1.23.3

so, I updated the config-dev file in the security repo -- @nickatnceas @taojing2002 if you grab the new copy it should work for you now. let me know if not.

nickatnceas commented 1 year ago

Matt reported issues with k8s-dev:

$ kubectl run -i -n jones --tty --rm debug --image=busybox --restart=Never -- sh
pod "debug" deleted
error: timed out waiting for the condition

And I had the same experience:

outin@halt-21280:~/.kube$ kubectl run -i -n nick --tty --rm debug --image=busybox --restart=Never -- sh
pod "debug" deleted
error: timed out waiting for the condition

After checking the logs I found recent errors related to the cert expiration in /var/log/containers/kube-apiserver-docker-dev-ucsb-1_kube-system_kube-apiserver-cabce005da75cb02ea886d5f351a79c9136c8c519097123d946165e2ef596d51.log:

{"log":"E0922 17:54:18.402829       1 authentication.go:63] \"Unable to authenticate the request\" err=\"[x509: certificate has expired or is not yet valid: current time 2022-09-22T17:54:18Z is after 2022-08-17T17:05:39Z, verifying certificate SN=932365683341995477, SKID=, AKID= failed: x509: certificate has expired or is not yet valid: current time 2022-09-22T17:54:18Z is after 2022-08-17T17:05:39Z]\"\n","stream":"stderr","time":"2022-09-22T17:54:18.403481156Z"}

I restarted kube-apiserver-docker-dev-ucsb-1 again (same method as above) which did not help. I ran systemctl restart kubelet, which caused more issues, such as api.test.dataone.org to go offline. I then rebooted k8s-dev-ctrl-1, and when it came back up api.test.dataone.org worked again, and I was able to run the test pod:

outin@halt-21280:~/.kube$ kubectl run -i -n nick --tty --rm debug --image=busybox --restart=Never -- sh
If you don't see a command prompt, try pressing enter.
/ #

It appears that more than just the four services specified above need to be restarted after renewing certs, and rebooting the controller will take care of all required pod restarts.