flannel-io / flannel

flannel is a network fabric for containers, designed for Kubernetes
Apache License 2.0
8.6k stars 2.87k forks source link

[v0.25.2 regression] `flannel.alpha.coreos.com/public-ip-overwrite` fails with `error looking up interface XXX.XXX.XXX.XXX: No interface with given IP found` after restarting a node #1978

Closed AkihiroSuda closed 3 weeks ago

AkihiroSuda commented 1 month ago

Expected Behavior

flannel.alpha.coreos.com/public-ip-overwrite should continue to work after restarting a node

Current Behavior

Flannel v0.25.2 fails with error looking up interface XXX.XXX.XXX.XXX: No interface with given IP found after restarting a node, even when flannel.alpha.coreos.com/public-ip-overwrite is specified to allow XXX.XXX.XXX.XXX.

It was working in Flannel v0.25.1.

Possible Solution

Revert:

Steps to Reproduce (for bugs)

  1. Checkout https://github.com/rootless-containers/usernetes/tree/gen2-v20240527.0

  2. Apply the following update:

    diff --git a/Makefile b/Makefile
    index 0e04f0e..a9c6e19 100644
    --- a/Makefile
    +++ b/Makefile
    @@ -150,4 +150,4 @@ kubeadm-reset:
    
    .PHONY: install-flannel
    install-flannel:
    -       $(NODE_SHELL) kubectl apply -f https://github.com/flannel-io/flannel/releases/download/v0.25.1/kube-flannel.yml
    +       $(NODE_SHELL) kubectl apply -f https://github.com/flannel-io/flannel/releases/download/v0.25.2/kube-flannel.yml
  3. Initialize a node. This step should be executed with Rootless Docker, but Rootful Docker is fine too.

    $ make up kubeadm-init install-flannel kubeconfig
    $ export KUBECONFIG=$(pwd)/kubeconfig
    $ kubectl get pods -A
  4. Restart the node, and see that the kube-flannel container fails:

    $ make down
    $ make up
    $ kubectl logs -n kube-flannel kube-flannel-ds-hrqt4 
    Defaulted container "kube-flannel" out of: kube-flannel, install-cni-plugin (init), install-cni (init)
    I0527 02:14:04.874324       1 main.go:211] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: version:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[] ifaceRegex:[] ipMasq:true ifaceCanReach: subnetFile:/run/flannel/subnet.env publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
    W0527 02:14:04.874476       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
    I0527 02:14:04.897430       1 kube.go:139] Waiting 10m0s for node controller to sync
    I0527 02:14:04.897737       1 kube.go:455] Starting kube subnet manager
    I0527 02:14:04.902721       1 kube.go:476] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.244.0.0/24]
    I0527 02:14:05.899116       1 kube.go:146] Node controller sync successful
    I0527 02:14:05.899197       1 main.go:231] Created subnet manager: Kubernetes Subnet Manager - u7s-suda-ws01
    I0527 02:14:05.899211       1 main.go:234] Installing signal handlers
    I0527 02:14:05.900566       1 main.go:452] Found network config - Backend type: vxlan
    I0527 02:14:05.908250       1 kube.go:655] List of node(u7s-suda-ws01) annotations: map[string]string{"flannel.alpha.coreos.com/backend-data":"{\"VNI\":1,\"VtepMAC\":\"b2:ad:96:1b:27:b2\"}", "flannel.alpha.coreos.com/backend-type":"vxlan", "flannel.alpha.coreos.com/kube-subnet-manager":"true", "flannel.alpha.coreos.com/public-ip":"192.168.60.11", "flannel.alpha.coreos.com/public-ip-overwrite":"192.168.60.11", "kubeadm.alpha.kubernetes.io/cri-socket":"unix:///var/run/containerd/containerd.sock", "node.alpha.kubernetes.io/ttl":"0", "volumes.kubernetes.io/controller-managed-attach-detach":"true"}
    I0527 02:14:05.908335       1 match.go:74] Searching for interface using 192.168.60.11
    E0527 02:14:05.908814       1 main.go:287] Failed to find any valid interface to use: error looking up interface 192.168.60.11: No interface with given IP found

Context

Regression in v0.25.2

Your Environment

AkihiroSuda commented 1 month ago

Probably related to:

tanvp112 commented 1 month ago

Confirm seeing this issue when using v0.25.2 on KIND cluster. Reverted to v0.25.1 for the time being.

rbrtbnfgl commented 1 month ago

I left this open. Changed how the parameter is passed it'll be fixed for the next release.

tanvp112 commented 1 month ago

@rbrtbnfgl, @AkihiroSuda,

Can I ask the definition of "flannel.alpha.coreos.com/node-public-ip":

Thanks.

rbrtbnfgl commented 1 month ago

It could be the addressable IP but generally it should use in case of a node with multiple interface and helps to select Flannel the right one. This issue was introduced because it was using the public-ip annotation and in case of public-ip-overwrite it was changed with the addressable IP that could be not a specific IP of the node and that's why it was failing because there wasn't any interface with that IP. I moved the logic within a new annotation so the public-ip is behaves as before and publuc-ip-overwrite can work as expected. In case you don't have to define an interface that flannel should use with a node specific IP you can omit the new annotation and Flannel will choose the interface as before.

tanvp112 commented 1 month ago

@rbrtbnfgl , thanks for the reply. In the case of KIND there's only one interface (eth0), but v0.25.2 has also failed with "No interface with given IP found after restarting a node" when the host restarted. Note this doesn't happen to v0.25.1. I suspect that could be something else because this shouldn't happen base on your replied,

rbrtbnfgl commented 1 month ago

This is how it should work with the fix. With the current release is not working as I explained. I think we'll release a new version with the right behavior.

AkihiroSuda commented 3 weeks ago

Thanks, v0.25.3 works fine