apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
1.97k stars 1.09k forks source link

Kubernetes certificates not automatically renewing #9418

Closed sagb closed 1 month ago

sagb commented 1 month ago
ISSUE TYPE

Bug Report

COMPONENT NAME

Kubernetes

CLOUDSTACK VERSION

4.19.0.2

CONFIGURATION

Kubernetes 1.27.3, two control nodes.

SUMMARY

Our K8s certificates expired. Seems like Cloudstack didn't automatically renew them.
I tried to renew them manually on both control nodes using:

kubeadm certs renew all
systemctl restart kubelet

This updated the certificates separately for each control node, and both are recognized by Kubernetes when using /etc/kubernetes/admin.conf as kubeconfig. However, CloudStack's "Kubernetes access" page only showed the old, expired certificate.

In attempt to trigger an automatic renewal, I've restored nodes from snapshot, stopped the k8s cluster from Cloudstack's web UI, and started it. It doesn't start, and there is an exception in /var/log/cloudstack/management/management-server.log:

2024-07-19 10:19:43,690 WARN  [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-19:ctx-b594d6ec job-4889 ctx-21a09500) (logid:767ef7b5) API endpoint for Kubernetes cluster : mediatech not available
javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake
    at java.base/sun.security.ssl.SSLSocketImpl.handleEOF(SSLSocketImpl.java:1701)
    at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1519)
    at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1421)
    at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:456)
    at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:427)
    at java.base/sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:580)
    at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:201)
    at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1614)
    at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1542)
    at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:250)
    at com.cloud.kubernetes.cluster.utils.KubernetesClusterUtil.isKubernetesClusterServerRunning(KubernetesClusterUtil.java:239)
    at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterStartWorker.startStoppedKubernetesCluster(KubernetesClusterStartWorker.java:590)
    at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.startKubernetesCluster(KubernetesClusterManagerImpl.java:1324)
    at org.apache.cloudstack.api.command.user.kubernetes.cluster.StartKubernetesClusterCmd.execute(StartKubernetesClusterCmd.java:113)
    at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:172)
    at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:112)
    at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:654)
    at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
    at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
    at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
    at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
    at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
    at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:602)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.EOFException: SSL peer shut down incorrectly
    at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:489)
    at java.base/sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:478)
    at java.base/sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:160)
    ... 28 more

The abscence of automatic renewal is probably a bug.

Also, I would be grateful for a hint how to recover from the current situation.

boring-cyborg[bot] commented 1 month ago

Thanks for opening your first issue here! Be sure to follow the issue template!

shwstppr commented 1 month ago

@sagb I don't think current code has the auto-renew functionality. https://github.com/apache/cloudstack/blob/main/plugins/integrations/kubernetes-service/src/main/java/com/cloud/kubernetes/cluster/actionworkers/KubernetesClusterStartWorker.java#L147-L149

Also, I can see certificates from ACS are issued for 10 years so it is interesting why it expired in your case.

The logs you shared are WARN level. They are logged initially when ACS MS tries to connect to the control node. The operation was tried repeatedly several times. It may be possible that the management server was able to connect to the control node later

sagb commented 1 month ago

The certificates on k8s control nodes are really expired:

control-node-1:~# kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Jul 18, 2024 08:23 UTC   <invalid>       ca                      no
apiserver                  Jul 18, 2024 08:22 UTC   <invalid>       ca                      no
apiserver-etcd-client      Jul 18, 2024 08:22 UTC   <invalid>       etcd-ca                 no
apiserver-kubelet-client   Jul 18, 2024 08:22 UTC   <invalid>       ca                      no
controller-manager.conf    Jul 18, 2024 08:22 UTC   <invalid>       ca                      no
etcd-healthcheck-client    May 26, 2024 08:19 UTC   <invalid>       etcd-ca                 no
etcd-peer                  May 26, 2024 08:19 UTC   <invalid>       etcd-ca                 no
etcd-server                May 26, 2024 08:19 UTC   <invalid>       etcd-ca                 no
front-proxy-client         Jul 18, 2024 08:22 UTC   <invalid>       front-proxy-ca          no
scheduler.conf             Jul 18, 2024 08:23 UTC   <invalid>       ca                      no

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      May 24, 2033 08:19 UTC   8y              no
etcd-ca                 May 24, 2033 08:19 UTC   8y              no
front-proxy-ca          May 24, 2033 08:19 UTC   8y              no

Since Cloudstack shows the expired certificate in "kubernetes control" web ui page, it has some control over them. How can I trigger the renewal?

It doesn't seem that Cloudstack will be able to connect to the control nodes later, as it has already been trying for several hours.

kiranchavala commented 1 month ago

@sagb

Cloudstack doesn't provide a way to automatic renewal k8s component certificates.

Your request can be an improvement request.

When you launch a cks cluster , internally CKS uses kubeadm to setup the kubernetes cluster

Client certificates generated by kubeadm expire after 1 year.

root@test-control-190e8277e14:~# kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Jul 25, 2025 04:32 UTC   364d            ca                      no
apiserver                  Jul 25, 2025 04:32 UTC   364d            ca                      no
apiserver-etcd-client      Jul 25, 2025 04:32 UTC   364d            etcd-ca                 no
apiserver-kubelet-client   Jul 25, 2025 04:32 UTC   364d            ca                      no
controller-manager.conf    Jul 25, 2025 04:32 UTC   364d            ca                      no
etcd-healthcheck-client    Jul 25, 2025 04:32 UTC   364d            etcd-ca                 no
etcd-peer                  Jul 25, 2025 04:32 UTC   364d            etcd-ca                 no
etcd-server                Jul 25, 2025 04:32 UTC   364d            etcd-ca                 no
front-proxy-client         Jul 25, 2025 04:32 UTC   364d            front-proxy-ca          no
scheduler.conf             Jul 25, 2025 04:32 UTC   364d            ca                      no

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Jul 23, 2034 04:32 UTC   9y              no
etcd-ca                 Jul 23, 2034 04:32 UTC   9y              no
front-proxy-ca          Jul 23, 2034 04:32 UTC   9y              no

Its up the admin user to login to control node and renew the client certificates

As a workaround

Login to the control node and try to delete the following pods after executing "kubeadm certs renew all".


        root@primary1-node:~# kubectl delete pod -n kube-system -l component=kube-apiserver
        root@primary1-node:~# kubectl delete pod -n kube-system -l component=kube-scheduler
        root@primary1-node:~# kubectl delete pod -n kube-system -l component=kube-controller-manager
        root@primary1-node:~# kubectl delete pod -n kube-system -l component=etcd

Another workaround is to upgrade the kubernetes version

kubeadm renews all the certificates during control plane upgrade.

Register 1.28.4 cks iso and upgrade the cks cluster which should renew the certificates

https://download.cloudstack.org/cks/

ref:

https://www.juniper.net/documentation/us/en/software/paragon-automation23.2/paragon-automation-troubleshooting-guide/topics/task/tg-manual-renew-kubeadm-cert.html

https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/

sagb commented 1 month ago

Thank you for the detailed reply.

My issue was complicated by a split-brain situation with etcd on the control nodes, likely caused by restoring control nodes from snapshots. We manually upgraded the certificates, then removed one of the nodes from the cluster and bootstrapped it again. This seems to have fixed everything.

The new kubectl appeared in the CloudStack GUI soon after this, though I still don't understand where it comes from.

To summarize your reply and the third-party documentation, the options for upgrading Kubernetes certificates in CloudStack are clear: either upgrade the certificates manually or upgrade the Kubernetes version.

I think the issue can be closed.