flannel-io / flannel

flannel is a network fabric for containers, designed for Kubernetes
Apache License 2.0
8.61k stars 2.87k forks source link

Flannel process gets stuck after etcd outage. #1830

Closed alperbas closed 1 week ago

alperbas commented 7 months ago

Expected Behavior

After etcd recovers, flannel should continue working normally.

Current Behavior

When etcd goes down, eg. reboot all etcd nodes at the same time, flannel process starts to use 200% cpu and stops updating routing table. And it never recovers after etcd comes back online until flannel is restarted.

I've noticed that the issue started with v0.21+. When I tested with v0.20, flannel just crashed when etcd went down and systemd restarted it until etcd was online and it just continued working. But with v0.21+, it just sits there using 200% cpu. I'm assuming whatever changed in the etcd connection logic gets stuck into an infinite loop somewhere. There are no logs shown even with v=10.

Possible Solution

I tried to browse the changes but there is just too many, I can't tell what went wrong and where.

Steps to Reproduce (for bugs)

  1. Setup a k8s cluster with separate etcd nodes and flannel working on the hosts in systemd. (I'm not sure if it would change things if flannel works as a pod).
  2. Take etcd down for a short time. Either block connections from firewall or just reboot at least 2 nodes at the same time.
  3. Observe flannel CPU usage. Wait for etcd to recover. Flannel still stuck on all nodes.
  4. Add a new worker node to cluster and observe that routing tables are not updated on nodes which flannel is stuck.

Context

If you add new nodes while flannel is stuck, routing tables doesn't get updated and the pods scheduled on new nodes cannot communicate with the rest of the cluster. Also all nodes gets a flat 2 cpu threads usage.

Your Environment

I am using charmed k8s from canonical. I have not tested other distributions but there is no reason it should not happen in others as well.

thomasferrandiz commented 7 months ago

can you share the systemd file you use to deploy flannel? In particular, do you set --kube-subnet-mgr to true or false?

alperbas commented 7 months ago

sure @thomasferrandiz, here it is. --kube-subnet-mgr is not set at all but I assume it is false with this unit file.

[Unit]
Description=Flannel Overlay Network
Documentation=https://github.com/coreos/flannel
Wants=network-online.target
After=network.target network-online.target

[Service]
ExecStart=/usr/local/bin/flanneld -iface=ethX -etcd-endpoints=https://x.x.x.x:2379,https://x.x.x.x:2379,https://x.x.x.x:2379 -etcd-certfile=/xx/client-cert.pem -etcd-keyfile=/xx/client-key.pem  -etcd-cafile=/xx/client-ca.pem --ip-masq
TimeoutStartSec=0
Restart=on-failure
LimitNOFILE=655536
thomasferrandiz commented 7 months ago

Thanks for the file. Indeed the default value for --kube-subnet-mgr is false which makes flannel use etcd to store its configuration.

Do you have a specific use case that requires using flannel with k8s but without setting kube-subnet-mgr to true?

If not, you could deploy flannel as a pod through the manifest kube-flannel.yml which will avoid the issue since flannel won't use etcd.

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.