Closed alperbas closed 1 week ago
can you share the systemd file you use to deploy flannel?
In particular, do you set --kube-subnet-mgr
to true or false?
sure @thomasferrandiz, here it is. --kube-subnet-mgr
is not set at all but I assume it is false with this unit file.
[Unit]
Description=Flannel Overlay Network
Documentation=https://github.com/coreos/flannel
Wants=network-online.target
After=network.target network-online.target
[Service]
ExecStart=/usr/local/bin/flanneld -iface=ethX -etcd-endpoints=https://x.x.x.x:2379,https://x.x.x.x:2379,https://x.x.x.x:2379 -etcd-certfile=/xx/client-cert.pem -etcd-keyfile=/xx/client-key.pem -etcd-cafile=/xx/client-ca.pem --ip-masq
TimeoutStartSec=0
Restart=on-failure
LimitNOFILE=655536
Thanks for the file.
Indeed the default value for --kube-subnet-mgr
is false which makes flannel use etcd to store its configuration.
Do you have a specific use case that requires using flannel with k8s but without setting kube-subnet-mgr
to true?
If not, you could deploy flannel as a pod through the manifest kube-flannel.yml which will avoid the issue since flannel won't use etcd.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Expected Behavior
After etcd recovers, flannel should continue working normally.
Current Behavior
When etcd goes down, eg. reboot all etcd nodes at the same time, flannel process starts to use 200% cpu and stops updating routing table. And it never recovers after etcd comes back online until flannel is restarted.
I've noticed that the issue started with v0.21+. When I tested with v0.20, flannel just crashed when etcd went down and systemd restarted it until etcd was online and it just continued working. But with v0.21+, it just sits there using 200% cpu. I'm assuming whatever changed in the etcd connection logic gets stuck into an infinite loop somewhere. There are no logs shown even with v=10.
Possible Solution
I tried to browse the changes but there is just too many, I can't tell what went wrong and where.
Steps to Reproduce (for bugs)
Context
If you add new nodes while flannel is stuck, routing tables doesn't get updated and the pods scheduled on new nodes cannot communicate with the rest of the cluster. Also all nodes gets a flat 2 cpu threads usage.
Your Environment
I am using charmed k8s from canonical. I have not tested other distributions but there is no reason it should not happen in others as well.