canonical / k8s-snap

Canonical Kubernetes is an opinionated and CNCF conformant Kubernetes operated by Snaps and Charms, which come together to bring simplified operations and an enhanced security posture on any infrastructure.
GNU General Public License v3.0
43 stars 13 forks source link

Add k8s endpoint check to markNodeReady #615

Closed HomayoonAlimohammadi closed 2 months ago

HomayoonAlimohammadi commented 3 months ago

Summary

onStart hook happens before onBootstrap. because of this, on non-fresh machines (non-fresh == /etc/kubernetes/admin.conf is available) we use invalid/old admin.conf kubeconfig for csrsigining controller client. this PR makes sure that we prevent running controllers until we have the correct .conf files and we can reach the k8s cluster.

How to test

build and install k8s on a fresh machine, run bootstrap and check logs and confirm the csrsigning controller is running, e.g.:

Aug 21 08:28:20 test k8s.k8sd[1642]: I0821 08:28:20.897429    1642 controller/controller.go:181] "Starting Controller" logger="k8sd.csrsigning" controller="certificatesignin
grequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest"
Aug 21 08:28:21 test k8s.k8sd[1642]: I0821 08:28:21.005667    1642 controller/controller.go:215] "Starting workers" logger="k8sd.csrsigning" controller="certificatesigningre
quest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" worker count=1

also confirm that there are kubeconfigs available in /etc/kubernetes/, specifically admin.conf. now remove the k8s snap and reinstall k8s (same snap). Run bootstrap and like above confirm that csrsigning controller is started and running.

HomayoonAlimohammadi commented 3 months ago

right, I didn't think about that. what if we run the csrsigning controller in onBootstrap hook? @neoaggelos

neoaggelos commented 3 months ago

Perhaps an alternative would be to block the csrsigning controller starting before we can validate that we have a proper kubeconfig in place

neoaggelos commented 3 months ago

For example, we could return the rest config from here https://github.com/canonical/k8s-snap/blob/2994b879eab39b782556f122a2c8512bf85e9aca/src/k8s/pkg/k8sd/controllers/csrsigning/controller.go#L46 only if we can successfully call an endpoint (e.g. /readyz or a GetNode()), otherwise keep looping.

e.g. here is what we do on bootstrap (without the loop, as we don't want to keep using stale kubeconfigs) https://github.com/canonical/k8s-snap/blob/db5015ec88663f107e40accd0d36ed7c082991f1/src/k8s/pkg/client/kubernetes/status.go#L17-L19

HomayoonAlimohammadi commented 3 months ago

adding a centralized wait didn't seem to be possible. in order to mark the node as ready onStart should finish so we couldn't wait there. I think adding the k8s check on markNodeReady is the best way to go.