flannel-io / flannel

flannel is a network fabric for containers, designed for Kubernetes
Apache License 2.0
8.72k stars 2.87k forks source link

Upgrading kubernetes-cni package puts cluster in a bad state #1721

Closed emosbaugh closed 8 months ago

emosbaugh commented 1 year ago

Expected Behavior

Upgrading kubelet and kubernetes should not put the cluster in a bad state.

Current Behavior

If I install kubernetes and kubelet and then later upgrade to a version that upgrades kubernetes-cni package as a subdependency, the /opt/cni/bin directory gets overwritten and the flannel binary is removed. It does not get recreated as this happens in an initContainer.

All pods get stuck in a ContainerCreating state as CNI operations fail.

Feb 08 18:53:32 ethanm-flannel-8 kubelet[10591]: E0208 18:53:32.174927   10591 remote_runtime.go:198] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"3b204965f15c79dfb07930071595b2bb1ed8eb9e989b50138c79242ff04cee68\": plugin type=\"flannel\" failed (add): failed to find plugin \"flannel\" in path [/opt/cni/bin]"
Feb 08 18:56:23 ethanm-flannel-8 kubelet[10591]: E0208 18:56:23.576029   10591 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"8c4e60b8-ace2-4637-9914-5bacfa4c56fb\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"3b204965f15c79dfb07930071595b2bb1ed8eb9e989b50138c79242ff04cee68\\\": plugin type=\\\"flannel\\\" failed (delete): failed to find plugin \\\"flannel\\\" in path [/opt/cni/bin]\"" pod="default/nginx-87b46959-6zrmx" podUID=8c4e60b8-ace2-4637-9914-5bacfa4c56fb

Possible Solution

It is possible to work around this issue by deleting the flannel pod so that the init container is run.

Steps to Reproduce (for bugs)

I've created the following script to reproduce.

https://gist.github.com/emosbaugh/05f340c2a48b4ab3b2e797ce11b38b74

Running this script produces output:

====================
+ ls /opt/cni/bin
bandwidth  bridge  dhcp  dummy  firewall  host-device  host-local  ipvlan  loopback  macvlan  portmap  ptp  sbr  static  tuning  vlan  vrf
+ kubectl get pod -A
NAMESPACE      NAME                                       READY   STATUS              RESTARTS   AGE
default        nginx-87b46959-6zrmx                       0/1     ContainerCreating   0          60s
default        nginx-9456bbbf9-b79f4                      1/1     Running             0          2m1s
kube-flannel   kube-flannel-ds-cgcvw                      1/1     Running             0          2m1s
kube-system    coredns-bd6b6df9f-6t5z2                    1/1     Running             0          2m1s
kube-system    coredns-bd6b6df9f-fpq8s                    1/1     Running             0          2m1s
kube-system    etcd-ethanm-flannel-8                      1/1     Running             0          2m9s
kube-system    kube-apiserver-ethanm-flannel-8            1/1     Running             0          2m8s
kube-system    kube-controller-manager-ethanm-flannel-8   1/1     Running             0          2m9s
kube-system    kube-proxy-bgq42                           1/1     Running             0          2m1s
kube-system    kube-scheduler-ethanm-flannel-8            1/1     Running             0          2m15s
+ journalctl -u kubelet -r | grep -m 1 -i 'failed to find plugin'
Feb 08 18:56:23 ethanm-flannel-8 kubelet[10591]: E0208 18:56:23.576029   10591 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"8c4e60b8-ace2-4637-9914-5bacfa4c56fb\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"3b204965f15c79dfb07930071595b2bb1ed8eb9e989b50138c79242ff04cee68\\\": plugin type=\\\"flannel\\\" failed (delete): failed to find plugin \\\"flannel\\\" in path [/opt/cni/bin]\"" pod="default/nginx-87b46959-6zrmx" podUID=8c4e60b8-ace2-4637-9914-5bacfa4c56fb
====================

Context

Your Environment

emosbaugh commented 1 year ago

I've filed a kubernetes issue as well as I'm not familiar enough with the interface to know where the responsibility lies.

https://github.com/kubernetes/kubernetes/issues/115629

rbrtbnfgl commented 1 year ago

The issue seems related to the flannel binaries been deleted. It seems strange that updating kubernetes-cni all the binaries are deleted. This could be an issue also for other CNI.

afbjorklund commented 1 year ago

The reason for this was that in 0.8.6, the "flannel" binary was included in the package:

be46c745d8bcb0517640385e031290d6 ./opt/cni/bin/flannel

CNI flannel plugin v0.8.6

Then the kubernetes project accidentally repackaged that old version, labelling it as 1.1.1

kubernetes-cni_1.1.1-00_amd64.deb

cni-plugins-linux-amd64-v0.8.6.tgz

This k8s packaging bug was finally fixed, removing the flannel plugin, starting with 1.2.0

It is now supposed to be installed to the host by the init container (install-cni-plugin).

stale[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.