Enable cilium addon fails in multinode cluster

slapcat commented 6 months ago

Summary

Enabling the cilium addon on an existing multinode cluster only works on the current node the command is run on. The change fails to take effect on other nodes leading to two issues:

New pods cannot be scheduled on the other nodes
If the other node is rebooted, all pods fail to start on it

The root cause seems to be that the cilium addon depends on the community addon being enabled, but this is not done automatically on the other nodes when enabling cilium. This leads to a situation where the other nodes are still configured for the calico CNI, but it does not exist.

What Should Happen Instead?

Cilium should be correctly configured on all nodes after running microk8s enable cilium.

Reproduction Steps

I used juju when testing the issue:

juju deploy microk8s --channel=1.28/stable -n 3 --series jammy
juju ssh microk8s/leader sudo microk8s enable community
juju ssh microk8s/leader sudo microk8s enable cilium
juju ssh microk8s/leader sudo microk8s.kubectl create deploy --replicas=3 --image=nginx test-deploy

You should now see a pod running on the microk8s/leader node, but pending on all others. You can also see that the contents of /var/snap/microk8s/current/args/cni-network on the microk8s nodes are different.

Introspection Report

N/A

Can you suggest a fix?

There is currently a workaround where you copy the contents of /var/snap/microk8s/current/args/cni-network on the working node and transfer it to the other nodes. Then snap restart microk8s.

If you are building the cluster from scratch, or moving from single node to multinode, you can also prepare new nodes by enabling community and cilium addons before running add-node.

Are you interested in contributing with a fix?

@ktsakalozos This regards an issue I asked you about earlier this week.

slapcat commented 6 months ago

Additional Details

Pod errors on other nodes after enabling cilium:

  Normal   Scheduled               3m2s                default-scheduler  Successfully assigned default/web-cilium-5f668dd859-mm5d8 to juju-9c0265-microk8s-1
  Warning  FailedCreatePodSandBox  3m2s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5b702f9572b97ec1f34b3e43691f1f0c5422326a3bfe5a799a96d70f0f913ea9": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized
  Normal   SandboxChanged          3s (x15 over 3m1s)  kubelet            Pod sandbox changed, it will be killed and re-created.

Working node /var/snap/microk8s/current/args/cni-network/ contents:

05-cilium-cni.conf  10-calico.conflist  calico-kubeconfig  cni.yaml.disabled

Broken node /var/snap/microk8s/current/args/cni-network/ contents:

10-calico.conflist  calico-kubeconfig  cni.yaml  cni.yaml.backup

mcosti commented 2 weeks ago

I am getting the exact same errors as you, but weirdly not always. I have a daemonset that can be launched into the other node, but the regular deployments cannot.

I am also confused on why calico is mentioned when it should have been removed from the system (I guess?)

Did you find any solution? I am connecting my nodes via tailscale.

With calico it worked, but it was a bit flaky, which is why I'm trying out cilium.

Thanks

canonical / microk8s