canonical / cluster-api-control-plane-provider-microk8s

This project offers a cluster API control plane controller that manages the control plane of a MicroK8s cluster. It is expected to be used along with the respective MicroK8s specific machine bootstrap provider.
https://microk8s.io
7 stars 5 forks source link

MAAS Cilium RollingUpgrade 1.27 to 1.28 makes old machine deleted before pods in old-version node are scheduled in different node #62

Closed Kun483 closed 4 days ago

Kun483 commented 4 months ago

make folders, .cluster-api/overrides/infrastructure-maas/v0.5.0 under v0.5.0 folder, create files cluster-template.yaml, infrastructure-components.yaml, metadata.yaml. They are from our repo. https://github.com/spectrocloud/cluster-api-provider-maas/blob/main/templates/cluster-template.yaml https://github.com/spectrocloud/cluster-api-provider-maas/blob/main/spectro/generated/core-global.yaml https://github.com/spectrocloud/cluster-api-provider-maas/blob/main/metadata.yaml Then,

kind create cluster
clusterctl init --infrastructure maas:v0.5.0 --bootstrap microk8s --control-plane microk8s

Then, kubectl apply manifest (pls replace variables): maas_microk8s_cilium_share.yaml.zip Then, in target cluster, install cilium:

helm install cilium cilium/cilium  \
    --namespace kube-system \
    --set cni.confPath=/var/snap/microk8s/current/args/cni-network \
    --set cni.binPath=/var/snap/microk8s/current/opt/cni/bin \
    --set daemon.runPath=/var/snap/microk8s/current/var/run/cilium \
    --set operator.replicas=1 \
    --set ipam.operator.clusterPoolIPv4PodCIDRList="10.1.0.0/16" \
    --set nodePort.enabled=true

Please execute clusterctl init --infrastructure maas:v0.5.0 --bootstrap microk8s --control-plane microk8s in target cluster after all pods in the first-launched are running. To triggered RollingUpgrade for CP nodes, I change 1.27.13 to 1.28.9 and 1.27 to 1.28 in - /capi-scripts/00-install-microk8s.sh '--channel 1.27/stable --classic' in preRunCommands in mcp I observed that New Node with 1.28 version join the cluster, then it forced cilium pod in old node deleted. Then that old machine is deleted. However, pods in that old machine are not scheduled to a different node yet.

e.g. when executing clusterctl init --infrastructure maas:v0.5.0 --bootstrap microk8s --control-plane microk8s, capi-microk8s-bootstrap-controller-manager, capi-microk8s-control-plane-controller-manager, and capi-controller-manager are on the node called 07. When RollingUpgrading, new machine (naming 08) is up and want to replace 07, these pods on 07 disappeared in the cluster since machine 07 is deleted before those pods are scheduled in different node. However, the deployment of those pods shows those pods are still READY 1/1. Then, ssh into machine 08, journalctl -u snap.microk8s.daemon-kubelite has an error below:

microk8s.daemon-kubelite[10219]: E0710 02:40:27.853000   10219 kubelet.go:2855] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

Environment: infrastructure-maas: v0.5.0 Kernel: 5.15.0-113-generic CAPI: v1.7.4 Microk8s Boostrap: v0.6.6 Microk8s Control Plane: v0.6.6 Container Runtime: containerd://1.6.28 OS: Ubuntu 22.04.3

HomayoonAlimohammadi commented 4 days ago

Hey! I think with the v0.6.11 release, this issue should be resolved. We added a node removal process during scale down which will remove the entry from underlying dqlite. Please let us know if you're still experiencing this issue.