BGP CP: exportPodCIDR Does Not Establish Connectivity Between Pods

danehans commented 1 year ago

Connectivity fails to establish between pods on different BGP CP nodes configured to advertise routes using exportPodCIDR. To reproduce:

Create a kind cluster:

$ kind create cluster --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true   # do not install kindnet
  kubeProxyMode: none       # do not run kube-proxy
nodes:
- role: control-plane
- role: worker
- role: worker

Install Cilium with BGP CP enabled:

$ helm upgrade --install cilium cilium/cilium --namespace kube-system --version 1.13.3 --values - <<EOF
kubeProxyReplacement: strict
k8sServiceHost: kind-control-plane # use master node in kind network
k8sServicePort: 6443               # use api server port
hostServices:
  enabled: false
externalIPs:
  enabled: true
nodePort:
  enabled: true
hostPort:
  enabled: true
image:
  pullPolicy: IfNotPresent
ipam:
  mode: kubernetes
hubble:
  enabled: true
  relay:
    enabled: true
tunnel: disabled
ipv4NativeRoutingCIDR: 10.0.0.0/8
bgpControlPlane:
  enabled: true
debug:
  enabled: true
EOF

Note: Im unsure what to use for ipv4NativeRoutingCIDR: 10.0.0.0/8. When I use only the pod network 10.244.0.0/16 CoreDNS fails to start b/c it fails to connect to the kube-api service VIP.

Verify the install:

$ cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:             OK
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    disabled (using embedded mode)
 \__/¯¯\__/    Hubble Relay:       OK
    \__/       ClusterMesh:        disabled

Deployment        hubble-relay       Desired: 1, Ready: 1/1, Available: 1/1
Deployment        cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
DaemonSet         cilium             Desired: 3, Ready: 3/3, Available: 3/3
Containers:       cilium-operator    Running: 2
                  cilium             Running: 3
                  hubble-relay       Running: 1
# Please edit the object below. Lines beginning with a '#' will be ignored,
# Please edit the object below. Lines beginning with a '#' will be ignored,
Cluster Pods:     4/4 managed by Cilium
Image versions    cilium             quay.io/cilium/cilium:v1.13.3@sha256:77176464a1e11ea7e89e984ac7db365e7af39851507e94f137dcf56c87746314: 3
                  hubble-relay       quay.io/cilium/hubble-relay:v1.13.3@sha256:19e4aae5ff72cd9fbcb7d2d16a1570533320a478acc015fc91a4d41a177cadf6: 1
                  cilium-operator    quay.io/cilium/operator-generic:v1.13.3@sha256:fa7003cbfdf8358cb71786afebc711b26e5e44a2ed99bd4944930bba915b8910: 2

Get the node IPs to set the BGP Router ID:

$ k get nodes -o wide
NAME                 STATUS   ROLES           AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
kind-control-plane   Ready    control-plane   101s   v1.25.3   172.19.0.4    <none>        Ubuntu 22.04.1 LTS   5.15.0-73-generic   containerd://1.6.9
kind-worker          Ready    <none>          76s    v1.25.3   172.19.0.3    <none>        Ubuntu 22.04.1 LTS   5.15.0-73-generic   containerd://1.6.9
kind-worker2         Ready    <none>          77s    v1.25.3   172.19.0.2    <none>        Ubuntu 22.04.1 LTS   5.15.0-73-generic   containerd://1.6.9

Annotate the nodes for BGP CP:

$ k get node/kind-worker -o yaml | grep bgp
    cilium.io/bgp-virtual-router.65341: local-port=179,router-id=172.19.0.3
$ k get node/kind-worker2 -o yaml | grep bgp
    cilium.io/bgp-virtual-router.65342: local-port=179,router-id=172.19.0.2

Apply the BGP peering policy with exportPodCIDR set:

$ k get ciliumbgppeeringpolicies -o yaml
apiVersion: v1
items:
- apiVersion: cilium.io/v2alpha1
  kind: CiliumBGPPeeringPolicy
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"cilium.io/v2alpha1","kind":"CiliumBGPPeeringPolicy","metadata":{"annotations":{},"name":"worker"},"spec":{"nodeSelector":{"matchLabels":{"kubernetes.io/hostname":"kind-worker"}},"virtualRouters":[{"exportPodCIDR":true,"localASN":65341,"neighbors":[{"peerASN":65342,"peerAddress":"172.19.0.2/32"}]}]}}
    creationTimestamp: "2023-06-13T01:01:28Z"
    generation: 1
    name: worker
    resourceVersion: "1355"
    uid: 558c2c18-04fa-4c39-95b7-8f3912dc7fd3
  spec:
    nodeSelector:
      matchLabels:
        kubernetes.io/hostname: kind-worker
    virtualRouters:
    - exportPodCIDR: true
      localASN: 65341
      neighbors:
      - peerASN: 65342
        peerAddress: 172.19.0.2/32
- apiVersion: cilium.io/v2alpha1
  kind: CiliumBGPPeeringPolicy
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"cilium.io/v2alpha1","kind":"CiliumBGPPeeringPolicy","metadata":{"annotations":{},"name":"worker2"},"spec":{"nodeSelector":{"matchLabels":{"kubernetes.io/hostname":"kind-worker2"}},"virtualRouters":[{"exportPodCIDR":true,"localASN":65342,"neighbors":[{"peerASN":65341,"peerAddress":"172.19.0.3/32"}]}]}}
    creationTimestamp: "2023-06-13T01:01:31Z"
    generation: 1
    name: worker2
    resourceVersion: "1363"
    uid: 840fa942-57b1-4304-ab1b-daf370348388
  spec:
    nodeSelector:
      matchLabels:
        kubernetes.io/hostname: kind-worker2
    virtualRouters:
    - exportPodCIDR: true
      localASN: 65342
      neighbors:
      - peerASN: 65341
        peerAddress: 172.19.0.3/32
kind: List
metadata:
  resourceVersion: ""

Verify the status of the BGP peers:

$ k exec po/cilium-vktcx -n kube-system -c cilium-agent -- cilium bgp peers
Local AS   Peer AS   Peer Address   Session       Uptime   Family         Received   Advertised
65342      65341     172.19.0.3     established   10m55s   ipv4/unicast   1          1
                                                           ipv6/unicast   0          0
$  cilium git:(main) ✗ k exec po/cilium-n5sjp -n kube-system -c cilium-agent -- cilium bgp peers
Local AS   Peer AS   Peer Address   Session       Uptime   Family         Received   Advertised
65341      65342     172.19.0.2     established   11m8s    ipv4/unicast   1          1
                                                           ipv6/unicast   0          0

Run a test app (nginx) on each of the two worker nodes:

$ k apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nginx
        ports:
        - containerPort: 80
EOF

Get the IP's of the test app pods:

$ k get po -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
my-nginx-77d5cb496b-5vqrk   1/1     Running   0          11m   10.244.2.140   kind-worker    <none>           <none>
my-nginx-77d5cb496b-v77h8   1/1     Running   0          11m   10.244.1.233   kind-worker2   <none>           <none>

Test connectivity:

$ k exec po/my-nginx-77d5cb496b-5vqrk -- curl http://10.244.1.233
<TIMEOUT>

The logs indicate the destinations are created for the podCIDRs

# kind-worker
level=info msg="type:STATE peer:{conf:{local_asn:65341 neighbor_address:\"172.19.0.2\" peer_asn:65342} state:{local_asn:65341 neighbor_address:\"172.19.0.2\" peer_asn:65342 session_state:ESTABLISHED router_id:\"172.19.0.2\"} transport:{local_address:\"172.19.0.3\" local_port:57371 remote_port:179}}"
level=debug msg="sent update" Key=172.19.0.2 State=BGP_FSM_ESTABLISHED Topic=Peer asn=65341 attributes="[{Origin: i} 65341 {Nexthop: 172.19.0.3}]" component=gobgp.BgpServerInstance nlri="[10.244.2.0/24]" subsys=bgp-control-plane withdrawals="[]"
level=debug msg="received update" Key=172.19.0.2 Topic=Peer asn=65341 attributes="[{Origin: i} 65342 {Nexthop: 172.19.0.2}]" component=gobgp.BgpServerInstance nlri="[10.244.1.0/24]" subsys=bgp-control-plane withdrawals="[]"
level=debug msg="create Destination" Nlri=10.244.1.0/24 Topic=Table asn=65341 component=gobgp.BgpServerInstance subsys=bgp-control-plane

# kind-worker2
level=info msg="type:STATE peer:{conf:{local_asn:65342 neighbor_address:\"172.19.0.3\" peer_asn:65341} state:{local_asn:65342 neighbor_address:\"172.19.0.3\" peer_asn:65341 session_state:ESTABLISHED router_id:\"172.19.0.3\"} transport:{local_address:\"172.19.0.2\" local_port:179 remote_port:57371}}"
level=debug msg="received update" Key=172.19.0.3 Topic=Peer asn=65342 attributes="[{Origin: i} 65341 {Nexthop: 172.19.0.3}]" component=gobgp.BgpServerInstance nlri="[10.244.2.0/24]" subsys=bgp-control-plane withdrawals="[]"
level=debug msg="sent update" Key=172.19.0.3 State=BGP_FSM_ESTABLISHED Topic=Peer asn=65342 attributes="[{Origin: i} 65342 {Nexthop: 172.19.0.2}]" component=gobgp.BgpServerInstance nlri="[10.244.1.0/24]" subsys=bgp-control-plane withdrawals="[]"
level=debug msg="create Destination" Nlri=10.244.2.0/24 Topic=Table asn=65342 component=gobgp.BgpServerInstance subsys=bgp-control-plane

Node routing tables are not updated with the BGP pod CIDRs:

$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS          PORTS                       NAMES
45e0e9be9432   kindest/node:v1.25.3   "/usr/local/bin/entr…"   30 minutes ago   Up 30 minutes   127.0.0.1:51544->6443/tcp   kind-control-plane
d5bb0ed5803c   kindest/node:v1.25.3   "/usr/local/bin/entr…"   30 minutes ago   Up 30 minutes                               kind-worker2
522d9d25014e   kindest/node:v1.25.3   "/usr/local/bin/entr…"   30 minutes ago   Up 30 minutes                               kind-worker
$ docker exec 522d9d25014e ip route
default via 172.19.0.1 dev eth0
10.244.2.0/24 via 10.244.2.70 dev cilium_host src 10.244.2.70
10.244.2.70 dev cilium_host scope link
172.19.0.0/16 dev eth0 proto kernel scope link src 172.19.0.3
$  docker exec d5bb0ed5803c ip route
default via 172.19.0.1 dev eth0
10.244.1.0/24 via 10.244.1.69 dev cilium_host src 10.244.1.69
10.244.1.69 dev cilium_host scope link
172.19.0.0/16 dev eth0 proto kernel scope link src 172.19.0.2

danehans commented 1 year ago

@rastislavs @YutaroHayakawa @harsimran-pabla PTAL when you have a moment and let me know if I am missing something that's causing this issue.

YutaroHayakawa commented 1 year ago

BGP CPlane doesn't import the route. So, you can't use it for establishing the Node-to-Node connectivity by meshing the Nodes. You can use auto-direct-node-routes option to achieve the same goal (https://docs.cilium.io/en/stable/network/concepts/routing/#id3).

danehans commented 1 year ago

@YutaroHayakawa thanks for the feedback. From reading the BGP CP docs, I didn't realize this was expected behavior. I previously tested auto-direct-node-routes and that worked as expected but requires L2 adjacency among nodes. According to the docs, kube-router should be used for native routing. Is this still the case with BGP CP? I'm trying to understand why different BGP solutions are used to establish native end-to-end connectivity among nodes that are not L2 adjacent.

dhess commented 1 year ago

@YutaroHayakawa Sorry, I'm also confused about this. I've just today set up a new cluster with the BGP control plane enabled. The BGP sessions are established and I can see the routes to my worker node pods (10.244.x.0/24) in my leaf router's routing table. But the worker nodes do not have L2 adjacency. So why isn't this configuration sufficient for node-to-node connectivity when native routing is enabled?

YutaroHayakawa commented 1 year ago

@danehans @dhess Am I missing something? If we don't have L2 reachability between nodes, then how can we reach the Pods in another node even if we exchange the route? Say when NodeA has PodCIDR 10.0.0.0/24 and NodeIP 192.168.0.1 and NodeB gets the route 10.0.0.0/24 via 192.168.0.1. When the NodeB tries to reach the Pod on the NodeA with IP 10.0.0.1, it ARPs to 192.168.0.1, but 192.168.0.1 is out of the L2 domain, so ARP doesn't reach NodeA, so it cannot go anywhere.

dhess commented 1 year ago

@YutaroHayakawa Perhaps I'm the one missing something and there's something I don't get about Cilium or how Kubernetes networking works, but in my case, there is a BGP route reflector (https://networklessons.com/bgp/bgp-route-reflector) in my network whose leaves are individual Cilium nodes:

So NodeB in your scenario would get the route to NodeA's pods via the route reflector, not directly from NodeA.

YutaroHayakawa commented 1 year ago

In general, the route reflector doesn't modify the next-hop, so if the original route is advertised from NodeA, the next-hop is NodeA. So, it's effectively the same as receiving the route directly from NodeA. Unless all of your network devices in the same AS are connected to the same route reflector (or all of your nodes are in the same L2 domain), you can't make node-to-node connectivity (https://notes.networklessons.com/bgp-ibgp-split-horizon-rule).

dhess commented 1 year ago

@YutaroHayakawa I'm confused by your comments, and I wonder if we're talking past each other here. If so, my apologies.

In any case, I'm running eBGP. The route reflector is AS 65300, NodeA is AS 65201, and NodeB is AS 65202.

YutaroHayakawa commented 1 year ago

When I see the word route-reflector, it's all about iBGP. For the eBGP equivalent, I call it route-server. The NetworkLessons article you mentioned also says,

Route reflectors (RR) are one method to get rid of the full-mesh of IBGP peers in your network.

What's your actual network topology look like? I guess your issue is different from the original issue. The original issue is about meshing the nodes with BGP each other, but you seem to have a different topology.

danehans commented 1 year ago

Use Case: multiple clusters that share the same L2 segment. autoDirectNodeRoutes: true can provide intra-cluster connectivity but not inter-cluster connectivity. If all nodes have a single network interface with a default route, the default gateway can resolve the destination pod IP to the appropriate node as a work around to this issue. However, inter-cluster traffic now relies on an external ARP resolver. Additionally, this issue still exists if cluster nodes have separate interfaces for external traffic (uses default route) and internal traffic (inter-cluster).

YutaroHayakawa commented 1 year ago

I'd recommend using ClusterMesh + autoDirectNodeRoutes in that case, but I understand the point.

danehans commented 1 year ago

@YutaroHayakawa based on your above feedback and since https://github.com/cilium/cilium/pull/26195 merged, should this issue be closed?

YutaroHayakawa commented 1 year ago

Yep, thanks for your doc contribution!

cilium / cilium

BGP CP: exportPodCIDR Does Not Establish Connectivity Between Pods #26146