Inconsistent behavior between public-ipv6 annotations and public-ipv6 cli option

dvgt commented 8 months ago

Expected Behavior

The node annotation flannel.alpha.coreos.com/public-ipv6 or flannel.alpha.coreos.com/public-ipv6-overwrite (if set), should have the same impact as setting the --public-ipv6 option of the flanneld binary.

Current Behavior

When setting one or both annotations to a specific address and not specifying the --public-ipv6 cli option, the external address used by flannel seems to always be the first address of the interface which also has the address from the annotations.
When using the --public-ipv6 cli option, the external address used by flannel is always the address given by the cli option.

Possible Solution

Steps to Reproduce (for bugs)

Setup a dual-stack cluster with at least two nodes (we use rancher RKE2) using canal as network plugin.

Example config for a master node (/etc/rancher/rke2/config.yaml):

server: https://10.50.147.218:9345
container-runtime-endpoint: /run/containerd/containerd.sock
write-kubeconfig: /etc/rancher/rke2/rke2.yaml
write-kubeconfig-mode: 0644
debug: False
tls-san:
  - 10.50.147.218
  - 2001:x:x:x:x:x:x:218
data-dir: /var/lib/rancher/rke2
cluster-cidr: 10.44.0.0/16,2001:x:x:y::/108
service-cidr: 10.43.0.0/16,2001:x:x:z::/112
service-node-port-range: 30000-32767
cluster-dns: 10.43.0.10
cluster-domain: cluster.local
node-name: master-node
node-external-ip: 10.50.147.218,2001:x:x:x:x:x:x:218
node-ip: 10.50.147.218,2001:x:x:x:x:x:x:218
node-taint:
  - node-role.kubernetes.io/etcd=true:NoExecute
selinux: False
disable:
  - rke2-ingress-nginx
disable-cloud-controller: False
etcd-expose-metrics: True
etcd-disable-snapshots: False
etcd-snapshot-name: etcd-snapshot
etcd-snapshot-schedule-cron: 0 */1 * * *
etcd-snapshot-retention: 12
etcd-snapshot-dir: /var/lib/rancher/rke2/server/db/snapshots
kube-apiserver-arg:
  - kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
  - disable-admission-plugins=AlwaysPullImages
  - authorization-mode=Node,RBAC enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,NodeRestriction,Priority,TaintNodesByCondition,PersistentVolumeClaimResize
kube-controller-manager-arg:
  - node-cidr-mask-size-ipv4=24
  - node-cidr-mask-size-ipv6=116
disable-scheduler: False
kubelet-arg:
  - fail-swap-on=true
  - max-pods=110
disable-kube-proxy: False
cni: canal

Example config for a worker node (/etc/rancher/rke2/config.yaml)

server: https://10.50.147.221:9345
token: <MASTER_TOKEN>
container-runtime-endpoint: /run/containerd/containerd.sock
debug: False
data-dir: /var/lib/rancher/rke2
node-name: worker-node
node-external-ip: 10.50.147.105,2001:x:x:x:x:x:x:105
node-ip: 10.50.147.105,2001:x:x:x:x:x:x:105
selinux: False
kubelet-arg:
  - fail-swap-on=true
  - max-pods=110
disable-kube-proxy: False

kubectl get cm -n kube-system rke2-canal-config -o yaml
apiVersion: v1
data:
  canal_iface: ""
  cni_network_config: |-
    {
      "name": "k8s-pod-network",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "calico",
          "log_level": "info",
          "datastore_type": "kubernetes",
          "nodename": "__KUBERNETES_NODE_NAME__",
          "mtu": __CNI_MTU__,
          "ipam": {
              "type": "host-local",
              "ranges": [
                  [
                      {
                          "subnet": "usePodCidr"
                      }
                  ],
                  [
                      {
                          "subnet": "usePodCidrIPv6"
                      }
                  ]
              ]
          },
          "policy": {
              "type": "k8s"
          },
          "kubernetes": {
              "kubeconfig": "__KUBECONFIG_FILEPATH__"
          }
        },
        {
          "type": "portmap",
          "snat": true,
          "capabilities": {"portMappings": true}
        },
        {
          "type": "bandwidth",
          "capabilities": {"bandwidth": true}
        }
      ]
    }
  masquerade: "true"
  net-conf.json: |
    {
      "Network": "10.44.0.0/16",
      "IPv6Network": "2001:x:x:y::/108",
      "EnableIPv6": true,
      "Backend": {
        "Type": "vxlan"
      }
    }
  typha_service_name: none
  veth_mtu: "1450"
[...]

On the worker node, add an extra IPv6 address to the interface that is used for inter-pod traffic ip a add 2001:x:x:x:x:x:x:229/64 dev ens192
Delete the canal pod that's running on the worker node kubectl delete pod rke2-canal-... -n kube-system
Observe that the flannel logs of the worker node mention the wrong external address (2001:x:x:x:x:x:x:229 iso 2001:x:x:x:x:x:x:105)
```
kubectl logs rke2-canal-..
```

I1004 16:30:13.143547 1 main.go:543] Found network config - Backend type: vxlan I1004 16:30:13.143620 1 match.go:206] Determining IP address of default interface I1004 16:30:13.146384 1 match.go:259] Using interface with name ens192 and address 10.50.147.105 I1004 16:30:13.146440 1 match.go:262] Using interface with name ens192 and v6 address 2001:x:x:x:x:x:x:229 I1004 16:30:13.146463 1 match.go:281] Defaulting external address to interface address (10.50.147.105) I1004 16:30:13.146486 1 match.go:294] Defaulting external v6 address to interface address (2001:x:x:x:x:x:x:229)


5. Deploy a pod on each node using the overlay network and start a ping between them

NODE1="master-node" NODE2="worker-node" for NODE in ${NODE1} ${NODE2}; do \ kubectl run --restart=Never --overrides="{ \"spec\": { \"nodeSelector\": { \"kubernetes.io/hostname\": \"${NODE}\" } } }" --image=docker.io/library/busybox:1.28 busybox-${NODE} -- sh -c 'sleep 36000'; \ done sleep 10 IPV6POD2=$(kubectl get pods busybox-${NODE2} -o custom-columns=IP:".status.podIPs[1].ip" --no-headers); echo ${IPV6POD2} kubectl exec -it busybox-${NODE1} -- ping -6 ${IPV6POD2}


6. Observe the inter-node traffic on a node

tcpdump -i ens192 -pnnev ip6 16:37:52.833187 aa:bb:cc:dd:ee:01 > aa:bb:cc:dd:ee:02, ethertype IPv6 (0x86dd), length 188: (hlim 64, next-header UDP (17) payload length: 134) 2001:x:x:x:x:x:x:218.59599 > 2001:x:x:x:x:x:x:105.8472: [bad udp cksum 0x187f -> 0xcd86!] OTV, flags [I] (0x08), overlay 0, instance 1 86:8a:9a:80:a1:c5 > f6:18:3e:eb:4d:18, ethertype IPv6 (0x86dd), length 118: (flowlabel 0xda0ae, hlim 63, next-header ICMPv6 (58) payload length: 64) 2001:x:x:y::301f > 2001:x:x:y::4023: [icmp6 sum ok] ICMP6, echo request, seq 7 16:37:52.833575 aa:bb:cc:dd:ee:02 > aa:bb:cc:dd:ee:01, ethertype IPv6 (0x86dd), length 188: (hlim 64, next-header UDP (17) payload length: 134) 2001:x:x:x:x:x:x:229.54910 > 2001:x:x:x:x:x:x:218.8472: [udp sum ok] OTV, flags [I] (0x08), overlay 0, instance 1 f6:18:3e:eb:4d:18 > 86:8a:9a:80:a1:c5, ethertype IPv6 (0x86dd), length 118: (flowlabel 0xd15b5, hlim 63, next-header ICMPv6 (58) payload length: 64) 2001:x:x:y::4023 > 2001:x:x:y::301f: [icmp6 sum ok] ICMP6, echo reply, seq 7


The echo reply is sent with source address `2001:x:x:x:x:x:x:229` iso `2001:x:x:x:x:x:x:105` on the worker node.

## Context
<!--- How has this issue affected you? What are you trying to accomplish? -->
<!--- Providing context helps us come up with a solution that is most useful in the real world -->
We have a dual-stack cluster setup with rancher RKE2. ON the interface that is used for inter-node kubernetes traffic there are multiple IPv6 addresses. We want to specifically use one of those addresses, not necessarily the first one, so that inter-node kubernetes packets have that source address. In rancher RKE2, flannel (as part of canal) is deployed via a rancher provided helm chart, which we don't want to manually modify. The only option to force the use of a specific public IPv6 address is to set the public-ipv6 annotations, which don't seem to have the expected behavior.

## Your Environment
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Flannel version: rancher/hardened-flannel:v0.22.0-build20230609
* Backend used (e.g. vxlan or udp): vxlan
* Etcd version: rancher/hardened-etcd:v3.5.7-k3s1-build20230406 
* Kubernetes version (if used): v1.25.11+rke2r1
* Operating System and version: Oracle Linux Server 8.8

Edit: Added rke2-canal-config config map data.

rbrtbnfgl commented 8 months ago

Hi thanks for reporting this. Checking from the code the public-ip is used only to select the interface and what you are saying is right. We can rework the code to force to use the same defined IP in case of multiple IPs on that interface. Edit: I think I misunderstood what you wrote. The annotation public-ip is not used by flannel to select the IP but it's configured by the CNI itself. The flannel configuration is specified only by the cli options on the current implementation and not from the annotations. Edit 2: you can configure the helm chart values with RKE2 without editing the chart itself.

dvgt commented 8 months ago

Thanks for your response. Let me clarify a bit more.

The public-ipv6 annotation seems to be always set correctly. While this is inconsistent with the real IP used by flannel at the moment, I don't think that plays a role for us. What would be more helpful is that setting the public-ipv6-overwrite annotation would force flannel to use the IP from the annotation iso the first IP of the interface. In other words, we expected that the public-ipv6-overwrite annotation would do the same as the --public-ipv6 cli option, which would then make the value of the public-ipv6 annotation also match.

I'm guessing this is the behavior that you mention which could updated in the code, right? That would be very helpful, because this is a setting that we can't control with helm chart values in RKE2. Flannel is deployed as container that calls the /opt/bin/flanneld directly, so there is no potential to customize that per node.

kubectl describe ds -n kube-system rke2-canal
[...]
   kube-flannel:
    Image:      rancher/hardened-flannel:v0.22.0-build20230609-custom1
    Port:       <none>
    Host Port:  <none>
    Command:
      /opt/bin/flanneld
    Args:
      --ip-masq
      --kube-subnet-mgr
    Environment:
      POD_NAME:           (v1:metadata.name)
      POD_NAMESPACE:      (v1:metadata.namespace)
      FLANNELD_IFACE:    <set to the key 'canal_iface' of config map 'rke2-canal-config'>  Optional: false
      FLANNELD_IP_MASQ:  <set to the key 'masquerade' of config map 'rke2-canal-config'>   Optional: false
    Mounts:
      /etc/kube-flannel/ from flannel-cfg (rw)
      /run/xtables.lock from xtables-lock (rw)
  Volumes:
   flannel-cfg:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rke2-canal-config
    Optional:  false

rbrtbnfgl commented 8 months ago

Yes you are right I'm inspecting the code to check if there are a possible solution for your issue.

rbrtbnfgl commented 8 months ago

Ok you are right. How you wrote on the issue there is a public-ip-overwrite but no public-ipv6-overwrite we didn't add it when the ipv6 support was introduced. Edit: Reading from the docs the purpose of the overwrite is for the destination for the VXLAN tunnel not the source. You can force the public-ip using the env variable FLANNELD_PUBLIC_IPV6 I can try to check if it's feasible to use different environment settings for canal in rke2.

dvgt commented 8 months ago

Initially, I found a reference to public-ipv6-overwrite in the below code reference, which is why we started using it. This is an easy way to modify the behavior per node. https://github.com/flannel-io/flannel/blob/44f55847424b67239793ff3c74a3294e401d8895/pkg/subnet/kube/annotations.go#L68

Regarding the FLANNELD_PUBLIC_IPV6 environment variable:

The only way I think this environment variable can be set from within the deamonset is to expose pod information via a fieldRef (https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.27/#envvarsource-v1-core), but in a dual-stack cluster, there is no field that just contains the IPv6 address: status.podIP contains the IPv4 address and status.podIPs contains a list of IPs. So I don't see a possibility there to make that work.
We would rather not change anything to the RKE2 deployed components (helm charts or components post-deploy) except for helm chart values. So if you would know or be able implement a method that accomplishes this, that would be great!

Regarding the meaning of the overwrite annotation:

What I understand from your explanation is that the annotation on nodeX is used as destination address by nodeY for inter-node packets send by nodeY to nodeX, right? I tested this for IPv4, but I don't really observe this behavior:

Set public-ip-overwrite on k8s-6-5 (10.50.147.106) to 10.50.147.229. This IP is also present on the same interface as 10.50.147.106
```
# kubectl annotate node k8s-6-5 --overwrite "flannel.alpha.coreos.com/public-ip-overwrite=10.50.147.229"
```

[k8s-6-5 ~]# ip a show ens192 2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether aa:bb:cc:dd:ee:01 brd ff:ff:ff:ff:ff:ff altname enp11s0 inet 10.50.147.106/24 brd 10.50.147.255 scope global noprefixroute ens192 valid_lft forever preferred_lft forever inet 10.50.147.229/32 scope global ens192 valid_lft forever preferred_lft forever inet6 2001:x:x:x:x:x:x:229/128 scope global nodad deprecated valid_lft forever preferred_lft 0sec inet6 2001:x:x:x:x:x:x:106/64 scope global noprefixroute valid_lft forever preferred_lft forever inet6 fe80::x:x:x:x/64 scope link noprefixroute valid_lft forever preferred_lft forever

[k8s-6-5 ~]# ip route default via 10.50.147.1 dev ens192 10.0.0.0/8 via 10.50.147.1 dev ens192 proto static metric 100 10.50.147.0/24 dev ens192 scope link

- Delete the canal pod running on node k8s-6-5.
- Logs of the new canal pod running on node k8s-6-5 show the IPv4 public-ip was overwritten.

I1019 16:06:51.799238 1 main.go:232] Created subnet manager: Kubernetes Subnet Manager - k8s-6-5 I1019 16:06:51.799246 1 main.go:235] Installing signal handlers I1019 16:06:51.799445 1 main.go:543] Found network config - Backend type: vxlan I1019 16:06:51.799515 1 match.go:206] Determining IP address of default interface I1019 16:06:51.802271 1 match.go:259] Using interface with name ens192 and address 10.50.147.106 I1019 16:06:51.802338 1 match.go:262] Using interface with name ens192 and v6 address 2001:x:x:x:x:x:x:229 I1019 16:06:51.802361 1 match.go:281] Defaulting external address to interface address (10.50.147.106) I1019 16:06:51.802384 1 match.go:294] Defaulting external v6 address to interface address (2001:x:x:x:x:x:x:229) I1019 16:06:51.802478 1 vxlan.go:141] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false I1019 16:06:51.803638 1 kube.go:386] Overriding public ip with '10.50.147.229' from node annotation 'flannel.alpha.coreos.com/public-ip-overwrite' <=== W1019 16:06:51.900358 1 main.go:596] no subnet found for key: FLANNEL_SUBNET in file: /run/flannel/subnet.env I1019 16:06:51.900393 1 main.go:482] Current network or subnet (10.44.0.0/16, 10.44.4.0/24) is not equal to previous one (0.0.0.0/0, 0.0.0.0/0), trying to recycle old iptables rules I1019 16:06:52.003024 1 main.go:357] Setting up masking rules W1019 16:06:52.007518 1 main.go:631] no subnet found for key: FLANNEL_IPV6_SUBNET in file: /run/flannel/subnet.env I1019 16:06:52.007548 1 main.go:508] Current ipv6 network or subnet (2001:x:x:x::/108, 2001:x:x:x::4000/116) is not equal to previous one (::/0, ::/0), trying to recycle old ip6tables rules

- Ping between pod on k8s-6-5 and other node and dump traffic on k8s-6-5.

[k8s-6-5 ~]# tcpdump -i ens192 -pnnev 'host 10.50.147.105' 18:13:25.938561 aa:bb:cc:dd:ee:01 > aa:bb:cc:dd:ee:02, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 1648, offset 0, flags [none], proto UDP (17), length 134) 10.50.147.105.34695 > 10.50.147.106.8472: OTV, flags [I] (0x08), overlay 0, instance 1 e2:47:8f:a7:72:20 > 86:ea:5b:ca:6d:dc, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 53072, offset 0, flags [DF], proto ICMP (1), length 84) 10.44.3.31 > 10.44.4.35: ICMP echo request, id 3328, seq 79, length 64

18:13:25.938755 aa:bb:cc:dd:ee:02 > aa:bb:cc:dd:ee:01, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 11106, offset 0, flags [none], proto UDP (17), length 134) 10.50.147.106.57518 > 10.50.147.105.8472: OTV, flags [I] (0x08), overlay 0, instance 1 86:ea:5b:ca:6d:dc > e2:47:8f:a7:72:20, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 44374, offset 0, flags [none], proto ICMP (1), length 84) 10.44.4.35 > 10.44.3.31: ICMP echo reply, id 3328, seq 79, length 64


The node with IP 10.50.147.105 still uses 10.50.147.106 as destination. 
Also, the IPv4 used by k8s-6-5 is also still the first IP on the interface, despite the `Overriding public ip` log in kube-flannel.

Am I missing something here?

rbrtbnfgl commented 8 months ago

I found the original issue #712 I have to check why flannel is behaving differently now, it could be related to different code modification done during the years.

dvgt commented 8 months ago

Any update on this?

rbrtbnfgl commented 8 months ago

Hi I tested it and you are right. The override seems to be noticed when flannel starts but the actual value is not updated. I am trying to understand if I could find a fix for it.

dvgt commented 5 months ago

Just to keep this active, we would still like to have a fix for this. We're happy to help testing :).

dvergotes commented 1 month ago

Just tested the fix and it works. Thanks for the effort. Note: I had to revert https://github.com/flannel-io/flannel/commit/2092b830ce0ef212ca38dc451143f844de9d58f5 See new issue: https://github.com/flannel-io/flannel/issues/1968

dvgt commented 1 month ago

Fixed, thanks!

flannel-io / flannel