Cilium LoadBalancer stops working if there is no other active session to the node IP (direct routing, kube-proxy-less, BGP)

eripa commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

What happened?

Hello,

I have an issue with Cilium running with direct routing with BGP control plane in kube-proxy free mode. Please note that I'm using a pre-release version of Cilium in order to be able to use the BGP control plane to announce LoadBalancer IP, which isn't part of v1.13-rc3 (quay.io/cilium/cilium-ci:dab8723c01c94998fd082ae1630a58f62f19658f).

I did experience the same issue with the MetalLB based approach, and upgraded in the hope that it would improve my situation. Once v1.13-rc4 is out I can switch to that version.

I have a suspicion that it's related to ARP and neighbour management (see ip neigh show below). But I'm not sure. If I remove the neighbour (sudo ip neigh del 192.168.5.157 lladdr 64:4b:f0:02:00:b3 dev enp2s0) it stops working in a seemingly same way.

When the client isn't in the nodes neighbour list, and traffic to the LoadBalancer is not working, any query to the Pod IP will work (and make the LoadBalancer work again, adding it to the neighbour list).

Please advice. Thank you! :pray:

Edit: a similar setup, but using k3s, metallb and kube-proxy w/ iptables works fine with LoadBalancer IPs always being reachable.

Problem statement

Overall the setup seems to work, but LoadBalancer access stops working after a period of inactivity (seemingly as sessions expire), until a new session is established with the node in some other way, like SSH to the node or curl/dig to a Pod IP on the same Kubernetes node.

If any session has been established outside hitting the Loadbalancer directly, then traffic works fine, but if the client goes offline for a while, and then comes back, the client can no longer reach the LoadBalancer IP. It only starts working again after that client has initialized another session with the node Internal IP or Pod IP.

There are no Network Policies involved.

From what I can tell the main issue is a) clients fail to establish a fresh session with a LoadBalancer IP and b) if a session is established in some other way (like accessing node/pod ip), then the session will eventually expire and the loadbalancer traffic breaks.

Worth noting is that other clients that have established sessions still works fine.

Details

Both clients and Kubernets workers lives on the same network subnet, 192.168.5.0/24. The Kubernetes workers announce the Pod CIDR and Load Balancer CIDR using the BGP Control Plane (not metallb).

No tunnel, direct routing, kube-proxy-free setup using BGP.

k8s-networking

Non working session

$ kubectl get node  -o wide
NAME      STATUS   ROLES    AGE    VERSION       INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION          CONTAINER-RUNTIME
tsunade   Ready    <none>   183d   v1.24.8+k0s   192.168.5.20   <none>        Debian GNU/Linux 11 (bullseye)   6.0.0-0.deb11.2-amd64   containerd://1.6.9

$ kubectl get svc -n dns-system pihole
NAME     TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                                  AGE
pihole   LoadBalancer   10.100.236.103   192.168.20.53   53:30351/UDP,53:30351/TCP,80:31841/TCP   2d4h

If I try to access the service from a Client that has been offline for a while:

$ dig debian.org @192.168.20.53 +tcp +retries=0 +tries=1

; <<>> DiG 9.10.6 <<>> debian.org @192.168.20.53 +tcp +retries=0 +tries=1
;; global options: +cmd
;; connection timed out; no servers could be reached
;; Connection to 192.168.20.53#53(192.168.20.53) for debian.org failed: timed out.

I can observe the following using hubble observe, where the TCP handshake arrives, but the client never sees the SYN-ACK and thus retransmits the SYN:

$ kubectl -n kube-system exec ds/cilium -- hubble observe --follow --ip 192.168.5.157

ec 17 20:01:27.008: 192.168.5.157:58576 (world) -> dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-endpoint FORWARDED (TCP Flags: SYN)
Dec 17 20:01:27.008: 192.168.5.157:58576 (world) <- dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-network FORWARDED (TCP Flags: SYN, ACK)
Dec 17 20:01:33.040: 192.168.5.157:58576 (world) <- dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-network FORWARDED (TCP Flags: SYN, ACK)
Dec 17 20:01:34.014: 192.168.5.157:58576 (world) -> dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-endpoint FORWARDED (TCP Flags: SYN)
Dec 17 20:01:40.272: 192.168.5.157:58576 (world) <- dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-network FORWARDED (TCP Flags: SYN, ACK)
Dec 17 20:01:48.468: 192.168.5.157:58576 (world) <- dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-network FORWARDED (TCP Flags: SYN, ACK)
Dec 17 20:02:04.592: 192.168.5.157:58576 (world) <- dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-network FORWARDED (TCP Flags: SYN, ACK)

Using tcpdump on the kubernetes worker (the router shows the same), I can see that the request comes in, and the SYN-ACK goes out. But the client never receives it, thus retransmits the package (Wireshark excerpt from pcap):

2 2.249509 192.168.5.157 192.168.20.53 TCP 78 49676 → 53 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=3887107149 TSecr=0 SACK_PERM
3 2.249562 192.168.20.53 192.168.5.157 TCP 74 53 → 49676 [SYN, ACK] Seq=0 Ack=1 Win=7548 Len=0 MSS=1270 SACK_PERM TSval=3159120442 TSecr=3887107149 WS=512
4 3.250129 192.168.5.157 192.168.20.53 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 49676 → 53 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=3887108150 TSecr=0 SACK_PERM
5 3.250210 192.168.20.53 192.168.5.157 TCP 74 [TCP Retransmission] 53 → 49676 [SYN, ACK] Seq=0 Ack=1 Win=7548 Len=0 MSS=1270 SACK_PERM TSval=3159121443 TSecr=3887107149 WS=512
6 4.251477 192.168.5.157 192.168.20.53 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 49676 → 53 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=3887109151 TSecr=0 SACK_PERM
7 4.251559 192.168.20.53 192.168.5.157 TCP 74 [TCP Retransmission] 53 → 49676 [SYN, ACK] Seq=0 Ack=1 Win=7548 Len=0 MSS=1270 SACK_PERM TSval=3159122444 TSecr=3887107149 WS=512
8 5.251606 192.168.5.157 192.168.20.53 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 49676 → 53 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=3887110152 TSecr=0 SACK_PERM
9 5.251671 192.168.20.53 192.168.5.157 TCP 74 [TCP Retransmission] 53 → 49676 [SYN, ACK] Seq=0 Ack=1 Win=7548 Len=0 MSS=1270 SACK_PERM TSval=3159123445 TSecr=3887107149 WS=512
10 6.252472 192.168.5.157 192.168.20.53 TCP 78 [TCP Retransmission] [TCP Port numbers reused] 49676 → 53 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=3887111152 TSecr=0 SACK_PERM

Edit: captured new dumps

~TCP dump pcap: pcap-not-working.zip~

pcap of trying a "fresh" (not working) DNS lookup attempt to the LoadBalancer:

The client never receives the SYN-ACK, i.e. I cannot see this package in tcpdump on the client.

pcap of the same request (working), but made directly to the PodIP (behind the same LoadBalanacer):

In this capture, we can see that there's an ARP request, which populates the neighbour list on the Kubernetes worker.

I can see the routes being properly announced via BGP to the router:

eric@ubnt:~$ show ip route bgp

IP Route Table for VRF "default"
B    *> 10.244.0.0/24 [200/0] via 192.168.5.20, eth1, 03:03:27
B    *> 192.168.20.1/32 [200/0] via 192.168.5.20, eth1, 03:03:27
B    *> 192.168.20.5/32 [200/0] via 192.168.5.20, eth1, 03:03:27
B    *> 192.168.20.50/32 [200/0] via 192.168.5.20, eth1, 03:03:27
B    *> 192.168.20.53/32 [200/0] via 192.168.5.20, eth1, 03:03:27

eric@ubnt:~$ show ip bgp neighbors

BGP neighbor is 192.168.5.20, remote AS 64512, local AS 64512, internal link
  BGP version 4, remote router ID 192.168.5.20
  BGP state = Established, up for 04:25:34
  Last read 04:25:34, hold time is 90, keepalive interval is 30 seconds
  Neighbor capabilities:
    Route refresh: advertised and received (new)
    4-Octet ASN Capability: advertised and received
    Address family IPv4 Unicast: advertised and received
    Address family IPv6 Unicast: received
  Received 498927 messages, 14 notifications, 0 in queue
  Sent 498715 messages, 6 notifications, 0 in queue
  Route refresh request: received 0, sent 0
  Minimum time between advertisement runs is 5 seconds
 For address family: IPv4 Unicast
  BGP table version 493, neighbor version 493
  Index 1, Offset 0, Mask 0x2
  Community attribute sent to this neighbor (both)
  5 accepted prefixes
  7 announced prefixes

 Connections established 71; dropped 70
Local host: 192.168.5.1, Local port: 179
Foreign host: 192.168.5.20, Foreign port: 33543
Nexthop: 192.168.5.1
Nexthop global: fe80::feec:daff:fe4a:15a7
Nexthop local: ::

Working session

If initiate a session with the node itself, 192.168.5.20, such as ssh or directly to a Pod IP. It starts working and works until the session expires.

Client:

$ dig debian.org @192.168.20.53 +tcp +retries=0 +tries=1 +short
128.31.0.62
130.89.148.77
149.20.4.15

and hubble observe:

$ kubectl -n kube-system exec ds/cilium -c cilium-agent -- hubble observe --follow --ip 192.168.5.157

Dec 17 21:04:26.570: 192.168.5.157:60183 (world) -> dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-endpoint FORWARDED (TCP Flags: SYN)
Dec 17 21:04:26.570: 192.168.5.157:60183 (world) <- dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-network FORWARDED (TCP Flags: SYN, ACK)
Dec 17 21:04:26.570: 192.168.5.157:60183 (world) -> dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-endpoint FORWARDED (TCP Flags: ACK)
Dec 17 21:04:26.575: 192.168.5.157:60183 (world) <> dns-system/pihole-dc5c664dc-7pvzn (ID:1327281) UNKNOWN 6 (TCP)
Dec 17 21:04:26.589: 192.168.5.157:60183 (world) -> dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Dec 17 21:04:26.590: 192.168.5.157:60183 (world) <- dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-network FORWARDED (TCP Flags: ACK, PSH)
Dec 17 21:04:26.592: 192.168.5.157:60183 (world) -> dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-endpoint FORWARDED (TCP Flags: ACK, FIN)
Dec 17 21:04:26.593: 192.168.5.157:60183 (world) <- dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-network FORWARDED (TCP Flags: ACK, FIN)
Dec 17 21:04:26.594: 192.168.5.157:60183 (world) -> dns-system/pihole-dc5c664dc-7pvzn:53 (ID:1327281) to-endpoint FORWARDED (TCP Flags: ACK)

Cilium Version

Please note that I'm using a pre-release version of Cilium in order to be able to use the BGP control plane to announce LoadBalancer IP, which isn't part of v1.13-rc3 (quay.io/cilium/cilium-ci:dab8723c01c94998fd082ae1630a58f62f19658f).

I needed the improvement in #22397 to fully be able to replace the metalLB based BGP implementation.

$ cilium version
cilium-cli: 0.12.11 compiled with go1.19.3 on darwin/arm64
cilium image (default): v1.12.2
cilium image (stable): v1.12.4
cilium image (running): -ci:dab8723c01c94998fd082ae1630a58f62f19658f

Kernel Version

Debian 11 with backports kernel:

Linux tsunade 6.0.0-0.deb11.2-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.0.3-1~bpo11+1 (2022-10-29) x86_64 GNU/Linux

Kubernetes Version

Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.8+k0s", GitCommit:"fdc77503e954d1ee641c0e350481f7528e8d068b", GitTreeState:"clean", BuildDate:"2022-11-17T13:15:01Z", GoVersion:"go1.18.8", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

The original zip file was too large, so I recompressed it using xz and then created a zip out of that because GitHub doesn't accept xz files. I hope this is OK. :sad:

cilium-sysdump-20221217-131223.tar.xz.zip

Relevant log output

level=info msg="Memory available for map entries (0.003% of 16663535616B): 41658839B" subsys=config
level=info msg="option bpf-ct-global-tcp-max set by dynamic sizing to 146171" subsys=config
level=info msg="option bpf-ct-global-any-max set by dynamic sizing to 73085" subsys=config
level=info msg="option bpf-nat-global-max set by dynamic sizing to 146171" subsys=config
level=info msg="option bpf-neigh-global-max set by dynamic sizing to 146171" subsys=config
level=info msg="option bpf-sock-rev-map-max set by dynamic sizing to 73085" subsys=config
level=info msg="  --agent-health-port='9879'" subsys=daemon
level=info msg="  --agent-labels=''" subsys=daemon
level=info msg="  --agent-not-ready-taint-key='node.cilium.io/agent-not-ready'" subsys=daemon
level=info msg="  --allocator-list-timeout='3m0s'" subsys=daemon
level=info msg="  --allow-icmp-frag-needed='true'" subsys=daemon
level=info msg="  --allow-localhost='auto'" subsys=daemon
level=info msg="  --annotate-k8s-node='true'" subsys=daemon
level=info msg="  --api-rate-limit=''" subsys=daemon
level=info msg="  --arping-refresh-period='30s'" subsys=daemon
level=info msg="  --auto-create-cilium-node-resource='true'" subsys=daemon
level=info msg="  --auto-direct-node-routes='true'" subsys=daemon
level=info msg="  --bgp-announce-lb-ip='false'" subsys=daemon
level=info msg="  --bgp-announce-pod-cidr='false'" subsys=daemon
level=info msg="  --bgp-config-path='/var/lib/cilium/bgp/config.yaml'" subsys=daemon
level=info msg="  --bpf-ct-global-any-max='262144'" subsys=daemon
level=info msg="  --bpf-ct-global-tcp-max='524288'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-any='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp='6h0m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp-fin='10s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp-syn='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-service-any='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-service-tcp='6h0m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-service-tcp-grace='1m0s'" subsys=daemon
level=info msg="  --bpf-filter-priority='1'" subsys=daemon
level=info msg="  --bpf-fragments-map-max='8192'" subsys=daemon
level=info msg="  --bpf-lb-acceleration='disabled'" subsys=daemon
level=info msg="  --bpf-lb-affinity-map-max='0'" subsys=daemon
level=info msg="  --bpf-lb-algorithm='random'" subsys=daemon
level=info msg="  --bpf-lb-dev-ip-addr-inherit=''" subsys=daemon
level=info msg="  --bpf-lb-dsr-dispatch='opt'" subsys=daemon
level=info msg="  --bpf-lb-dsr-l4-xlate='frontend'" subsys=daemon
level=info msg="  --bpf-lb-external-clusterip='false'" subsys=daemon
level=info msg="  --bpf-lb-maglev-hash-seed='JLfvgnHc2kaSUFaI'" subsys=daemon
level=info msg="  --bpf-lb-maglev-map-max='0'" subsys=daemon
level=info msg="  --bpf-lb-maglev-table-size='16381'" subsys=daemon
level=info msg="  --bpf-lb-map-max='65536'" subsys=daemon
level=info msg="  --bpf-lb-mode='hybrid'" subsys=daemon
level=info msg="  --bpf-lb-rev-nat-map-max='0'" subsys=daemon
level=info msg="  --bpf-lb-rss-ipv4-src-cidr=''" subsys=daemon
level=info msg="  --bpf-lb-rss-ipv6-src-cidr=''" subsys=daemon
level=info msg="  --bpf-lb-service-backend-map-max='0'" subsys=daemon
level=info msg="  --bpf-lb-service-map-max='0'" subsys=daemon
level=info msg="  --bpf-lb-sock='false'" subsys=daemon
level=info msg="  --bpf-lb-sock-hostns-only='false'" subsys=daemon
level=info msg="  --bpf-lb-source-range-map-max='0'" subsys=daemon
level=info msg="  --bpf-map-dynamic-size-ratio='0.0025'" subsys=daemon
level=info msg="  --bpf-map-event-buffers=''" subsys=daemon
level=info msg="  --bpf-nat-global-max='524288'" subsys=daemon
level=info msg="  --bpf-neigh-global-max='524288'" subsys=daemon
level=info msg="  --bpf-policy-map-max='16384'" subsys=daemon
level=info msg="  --bpf-root='/sys/fs/bpf'" subsys=daemon
level=info msg="  --bpf-sock-rev-map-max='262144'" subsys=daemon
level=info msg="  --bypass-ip-availability-upon-restore='false'" subsys=daemon
level=info msg="  --certificates-directory='/var/run/cilium/certs'" subsys=daemon
level=info msg="  --cflags=''" subsys=daemon
level=info msg="  --cgroup-root='/run/cilium/cgroupv2'" subsys=daemon
level=info msg="  --cilium-endpoint-gc-interval='5m0s'" subsys=daemon
level=info msg="  --cluster-health-port='4240'" subsys=daemon
level=info msg="  --cluster-id='20'" subsys=daemon
level=info msg="  --cluster-name='artemis'" subsys=daemon
level=info msg="  --clustermesh-config='/var/lib/cilium/clustermesh/'" subsys=daemon
level=info msg="  --cmdref=''" subsys=daemon
level=info msg="  --cni-chaining-mode=''" subsys=daemon
level=info msg="  --config=''" subsys=daemon
level=info msg="  --config-dir='/tmp/cilium/config-map'" subsys=daemon
level=info msg="  --conntrack-gc-interval='0s'" subsys=daemon
level=info msg="  --crd-wait-timeout='5m0s'" subsys=daemon
level=info msg="  --custom-cni-conf='false'" subsys=daemon
level=info msg="  --datapath-mode='veth'" subsys=daemon
level=info msg="  --debug='false'" subsys=daemon
level=info msg="  --debug-verbose=''" subsys=daemon
level=info msg="  --derive-masquerade-ip-addr-from-device=''" subsys=daemon
level=info msg="  --devices=''" subsys=daemon
level=info msg="  --direct-routing-device=''" subsys=daemon
level=info msg="  --disable-cnp-status-updates='true'" subsys=daemon
level=info msg="  --disable-endpoint-crd='false'" subsys=daemon
level=info msg="  --disable-envoy-version-check='false'" subsys=daemon
level=info msg="  --disable-iptables-feeder-rules=''" subsys=daemon
level=info msg="  --dns-max-ips-per-restored-rule='1000'" subsys=daemon
level=info msg="  --dns-policy-unload-on-shutdown='false'" subsys=daemon
level=info msg="  --dnsproxy-concurrency-limit='0'" subsys=daemon
level=info msg="  --dnsproxy-concurrency-processing-grace-period='0s'" subsys=daemon
level=info msg="  --dnsproxy-lock-count='128'" subsys=daemon
level=info msg="  --dnsproxy-lock-timeout='500ms'" subsys=daemon
level=info msg="  --egress-masquerade-interfaces=''" subsys=daemon
level=info msg="  --egress-multi-home-ip-rule-compat='false'" subsys=daemon
level=info msg="  --enable-auto-protect-node-port-range='true'" subsys=daemon
level=info msg="  --enable-bandwidth-manager='true'" subsys=daemon
level=info msg="  --enable-bbr='true'" subsys=daemon
level=info msg="  --enable-bgp-control-plane='true'" subsys=daemon
level=info msg="  --enable-bpf-clock-probe='true'" subsys=daemon
level=info msg="  --enable-bpf-masquerade='true'" subsys=daemon
level=info msg="  --enable-bpf-tproxy='false'" subsys=daemon
level=info msg="  --enable-cilium-endpoint-slice='false'" subsys=daemon
level=info msg="  --enable-custom-calls='false'" subsys=daemon
level=info msg="  --enable-endpoint-health-checking='true'" subsys=daemon
level=info msg="  --enable-endpoint-routes='false'" subsys=daemon
level=info msg="  --enable-envoy-config='false'" subsys=daemon
level=info msg="  --enable-external-ips='true'" subsys=daemon
level=info msg="  --enable-health-check-nodeport='true'" subsys=daemon
level=info msg="  --enable-health-checking='true'" subsys=daemon
level=info msg="  --enable-host-firewall='false'" subsys=daemon
level=info msg="  --enable-host-legacy-routing='false'" subsys=daemon
level=info msg="  --enable-host-port='true'" subsys=daemon
level=info msg="  --enable-hubble='true'" subsys=daemon
level=info msg="  --enable-hubble-open-metrics='false'" subsys=daemon
level=info msg="  --enable-hubble-recorder-api='true'" subsys=daemon
level=info msg="  --enable-icmp-rules='true'" subsys=daemon
level=info msg="  --enable-identity-mark='true'" subsys=daemon
level=info msg="  --enable-ip-masq-agent='true'" subsys=daemon
level=info msg="  --enable-ipsec='false'" subsys=daemon
level=info msg="  --enable-ipv4='true'" subsys=daemon
level=info msg="  --enable-ipv4-egress-gateway='true'" subsys=daemon
level=info msg="  --enable-ipv4-fragment-tracking='true'" subsys=daemon
level=info msg="  --enable-ipv4-masquerade='true'" subsys=daemon
level=info msg="  --enable-ipv6='false'" subsys=daemon
level=info msg="  --enable-ipv6-big-tcp='false'" subsys=daemon
level=info msg="  --enable-ipv6-masquerade='true'" subsys=daemon
level=info msg="  --enable-ipv6-ndp='false'" subsys=daemon
level=info msg="  --enable-k8s-api-discovery='false'" subsys=daemon
level=info msg="  --enable-k8s-endpoint-slice='true'" subsys=daemon
level=info msg="  --enable-k8s-event-handover='false'" subsys=daemon
level=info msg="  --enable-k8s-terminating-endpoint='true'" subsys=daemon
level=info msg="  --enable-l2-neigh-discovery='true'" subsys=daemon
level=info msg="  --enable-l7-proxy='true'" subsys=daemon
level=info msg="  --enable-local-node-route='true'" subsys=daemon
level=info msg="  --enable-local-redirect-policy='false'" subsys=daemon
level=info msg="  --enable-metrics='true'" subsys=daemon
level=info msg="  --enable-mke='false'" subsys=daemon
level=info msg="  --enable-monitor='true'" subsys=daemon
level=info msg="  --enable-node-port='false'" subsys=daemon
level=info msg="  --enable-pmtu-discovery='true'" subsys=daemon
level=info msg="  --enable-policy='default'" subsys=daemon
level=info msg="  --enable-recorder='false'" subsys=daemon
level=info msg="  --enable-remote-node-identity='true'" subsys=daemon
level=info msg="  --enable-runtime-device-detection='false'" subsys=daemon
level=info msg="  --enable-sctp='false'" subsys=daemon
level=info msg="  --enable-service-topology='false'" subsys=daemon
level=info msg="  --enable-session-affinity='false'" subsys=daemon
level=info msg="  --enable-srv6='false'" subsys=daemon
level=info msg="  --enable-stale-cilium-endpoint-cleanup='true'" subsys=daemon
level=info msg="  --enable-stateless-nat46x64='false'" subsys=daemon
level=info msg="  --enable-svc-source-range-check='true'" subsys=daemon
level=info msg="  --enable-tracing='false'" subsys=daemon
level=info msg="  --enable-unreachable-routes='false'" subsys=daemon
level=info msg="  --enable-vtep='false'" subsys=daemon
level=info msg="  --enable-well-known-identities='false'" subsys=daemon
level=info msg="  --enable-wireguard='false'" subsys=daemon
level=info msg="  --enable-wireguard-userspace-fallback='false'" subsys=daemon
level=info msg="  --enable-xdp-prefilter='false'" subsys=daemon
level=info msg="  --enable-xt-socket-fallback='true'" subsys=daemon
level=info msg="  --encrypt-interface=''" subsys=daemon
level=info msg="  --encrypt-node='false'" subsys=daemon
level=info msg="  --endpoint-gc-interval='5m0s'" subsys=daemon
level=info msg="  --endpoint-queue-size='25'" subsys=daemon
level=info msg="  --endpoint-status=''" subsys=daemon
level=info msg="  --envoy-config-timeout='2m0s'" subsys=daemon
level=info msg="  --envoy-log=''" subsys=daemon
level=info msg="  --exclude-local-address=''" subsys=daemon
level=info msg="  --fixed-identity-mapping=''" subsys=daemon
level=info msg="  --force-local-policy-eval-at-source='false'" subsys=daemon
level=info msg="  --fqdn-regex-compile-lru-size='1024'" subsys=daemon
level=info msg="  --gops-port='9890'" subsys=daemon
level=info msg="  --http-403-msg=''" subsys=daemon
level=info msg="  --http-idle-timeout='0'" subsys=daemon
level=info msg="  --http-max-grpc-timeout='0'" subsys=daemon
level=info msg="  --http-normalize-path='true'" subsys=daemon
level=info msg="  --http-request-timeout='3600'" subsys=daemon
level=info msg="  --http-retry-count='3'" subsys=daemon
level=info msg="  --http-retry-timeout='0'" subsys=daemon
level=info msg="  --hubble-disable-tls='false'" subsys=daemon
level=info msg="  --hubble-event-buffer-capacity='4095'" subsys=daemon
level=info msg="  --hubble-event-queue-size='0'" subsys=daemon
level=info msg="  --hubble-export-file-compress='false'" subsys=daemon
level=info msg="  --hubble-export-file-max-backups='5'" subsys=daemon
level=info msg="  --hubble-export-file-max-size-mb='10'" subsys=daemon
level=info msg="  --hubble-export-file-path=''" subsys=daemon
level=info msg="  --hubble-listen-address=':4244'" subsys=daemon
level=info msg="  --hubble-metrics='dns:query;ignoreAAAA,drop,tcp,flow,icmp,http'" subsys=daemon
level=info msg="  --hubble-metrics-server=':9965'" subsys=daemon
level=info msg="  --hubble-prefer-ipv6='false'" subsys=daemon
level=info msg="  --hubble-recorder-sink-queue-size='1024'" subsys=daemon
level=info msg="  --hubble-recorder-storage-path='/var/run/cilium/pcaps'" subsys=daemon
level=info msg="  --hubble-skip-unknown-cgroup-ids='true'" subsys=daemon
level=info msg="  --hubble-socket-path='/var/run/cilium/hubble.sock'" subsys=daemon
level=info msg="  --hubble-tls-cert-file='/var/lib/cilium/tls/hubble/server.crt'" subsys=daemon
level=info msg="  --hubble-tls-client-ca-files='/var/lib/cilium/tls/hubble/client-ca.crt'" subsys=daemon
level=info msg="  --hubble-tls-key-file='/var/lib/cilium/tls/hubble/server.key'" subsys=daemon
level=info msg="  --identity-allocation-mode='crd'" subsys=daemon
level=info msg="  --identity-change-grace-period='5s'" subsys=daemon
level=info msg="  --identity-gc-interval='15m0s'" subsys=daemon
level=info msg="  --identity-heartbeat-timeout='30m0s'" subsys=daemon
level=info msg="  --identity-restore-grace-period='10m0s'" subsys=daemon
level=info msg="  --install-egress-gateway-routes='false'" subsys=daemon
level=info msg="  --install-iptables-rules='true'" subsys=daemon
level=info msg="  --install-no-conntrack-iptables-rules='false'" subsys=daemon
level=info msg="  --ip-allocation-timeout='2m0s'" subsys=daemon
level=info msg="  --ip-masq-agent-config-path='/etc/config/ip-masq-agent'" subsys=daemon
level=info msg="  --ipam='kubernetes'" subsys=daemon
level=info msg="  --ipsec-key-file=''" subsys=daemon
level=info msg="  --iptables-lock-timeout='5s'" subsys=daemon
level=info msg="  --iptables-random-fully='false'" subsys=daemon
level=info msg="  --ipv4-native-routing-cidr='192.168.5.0/24'" subsys=daemon
level=info msg="  --ipv4-node='auto'" subsys=daemon
level=info msg="  --ipv4-pod-subnets=''" subsys=daemon
level=info msg="  --ipv4-range='auto'" subsys=daemon
level=info msg="  --ipv4-service-loopback-address='169.254.42.1'" subsys=daemon
level=info msg="  --ipv4-service-range='auto'" subsys=daemon
level=info msg="  --ipv6-cluster-alloc-cidr='f00d::/64'" subsys=daemon
level=info msg="  --ipv6-mcast-device=''" subsys=daemon
level=info msg="  --ipv6-native-routing-cidr=''" subsys=daemon
level=info msg="  --ipv6-node='auto'" subsys=daemon
level=info msg="  --ipv6-pod-subnets=''" subsys=daemon
level=info msg="  --ipv6-range='auto'" subsys=daemon
level=info msg="  --ipv6-service-range='auto'" subsys=daemon
level=info msg="  --join-cluster='false'" subsys=daemon
level=info msg="  --k8s-api-server=''" subsys=daemon
level=info msg="  --k8s-client-burst='0'" subsys=daemon
level=info msg="  --k8s-client-qps='0'" subsys=daemon
level=info msg="  --k8s-heartbeat-timeout='30s'" subsys=daemon
level=info msg="  --k8s-kubeconfig-path=''" subsys=daemon
level=info msg="  --k8s-namespace='kube-system'" subsys=daemon
level=info msg="  --k8s-require-ipv4-pod-cidr='false'" subsys=daemon
level=info msg="  --k8s-require-ipv6-pod-cidr='false'" subsys=daemon
level=info msg="  --k8s-service-cache-size='128'" subsys=daemon
level=info msg="  --k8s-service-proxy-name=''" subsys=daemon
level=info msg="  --k8s-sync-timeout='3m0s'" subsys=daemon
level=info msg="  --k8s-watcher-endpoint-selector='metadata.name!=kube-scheduler,metadata.name!=kube-controller-manager,metadata.name!=etcd-operator,metadata.name!=gcp-controller-manager'" subsys=daemon
level=info msg="  --keep-config='false'" subsys=daemon
level=info msg="  --kube-proxy-replacement='strict'" subsys=daemon
level=info msg="  --kube-proxy-replacement-healthz-bind-address='0.0.0.0:10256'" subsys=daemon
level=info msg="  --kvstore=''" subsys=daemon
level=info msg="  --kvstore-connectivity-timeout='2m0s'" subsys=daemon
level=info msg="  --kvstore-lease-ttl='15m0s'" subsys=daemon
level=info msg="  --kvstore-max-consecutive-quorum-errors='2'" subsys=daemon
level=info msg="  --kvstore-opt=''" subsys=daemon
level=info msg="  --kvstore-periodic-sync='5m0s'" subsys=daemon
level=info msg="  --label-prefix-file=''" subsys=daemon
level=info msg="  --labels=''" subsys=daemon
level=info msg="  --lib-dir='/var/lib/cilium'" subsys=daemon
level=info msg="  --local-max-addr-scope='252'" subsys=daemon
level=info msg="  --local-router-ipv4=''" subsys=daemon
level=info msg="  --local-router-ipv6=''" subsys=daemon
level=info msg="  --log-driver=''" subsys=daemon
level=info msg="  --log-opt=''" subsys=daemon
level=info msg="  --log-system-load='false'" subsys=daemon
level=info msg="  --max-controller-interval='0'" subsys=daemon
level=info msg="  --metrics=''" subsys=daemon
level=info msg="  --mke-cgroup-mount=''" subsys=daemon
level=info msg="  --monitor-aggregation='medium'" subsys=daemon
level=info msg="  --monitor-aggregation-flags='all'" subsys=daemon
level=info msg="  --monitor-aggregation-interval='5s'" subsys=daemon
level=info msg="  --monitor-queue-size='0'" subsys=daemon
level=info msg="  --mtu='0'" subsys=daemon
level=info msg="  --node-port-acceleration='disabled'" subsys=daemon
level=info msg="  --node-port-algorithm='random'" subsys=daemon
level=info msg="  --node-port-bind-protection='true'" subsys=daemon
level=info msg="  --node-port-mode='snat'" subsys=daemon
level=info msg="  --node-port-range='30000,32767'" subsys=daemon
level=info msg="  --nodes-gc-interval='5m0s'" subsys=daemon
level=info msg="  --operator-api-serve-addr='127.0.0.1:9234'" subsys=daemon
level=info msg="  --operator-prometheus-serve-addr=':6942'" subsys=daemon
level=info msg="  --policy-audit-mode='false'" subsys=daemon
level=info msg="  --policy-queue-size='100'" subsys=daemon
level=info msg="  --policy-trigger-interval='1s'" subsys=daemon
level=info msg="  --pprof='false'" subsys=daemon
level=info msg="  --pprof-port='6060'" subsys=daemon
level=info msg="  --preallocate-bpf-maps='false'" subsys=daemon
level=info msg="  --prepend-iptables-chains='true'" subsys=daemon
level=info msg="  --procfs='/host/proc'" subsys=daemon
level=info msg="  --prometheus-serve-addr=':9962'" subsys=daemon
level=info msg="  --proxy-connect-timeout='1'" subsys=daemon
level=info msg="  --proxy-gid='1337'" subsys=daemon
level=info msg="  --proxy-max-connection-duration-seconds='0'" subsys=daemon
level=info msg="  --proxy-max-requests-per-connection='0'" subsys=daemon
level=info msg="  --proxy-prometheus-port='9964'" subsys=daemon
level=info msg="  --read-cni-conf=''" subsys=daemon
level=info msg="  --remove-cilium-node-taints='true'" subsys=daemon
level=info msg="  --restore='true'" subsys=daemon
level=info msg="  --route-metric='0'" subsys=daemon
level=info msg="  --set-cilium-is-up-condition='true'" subsys=daemon
level=info msg="  --sidecar-istio-proxy-image='cilium/istio_proxy'" subsys=daemon
level=info msg="  --single-cluster-route='false'" subsys=daemon
level=info msg="  --skip-cnp-status-startup-clean='false'" subsys=daemon
level=info msg="  --socket-path='/var/run/cilium/cilium.sock'" subsys=daemon
level=info msg="  --sockops-enable='false'" subsys=daemon
level=info msg="  --srv6-encap-mode='reduced'" subsys=daemon
level=info msg="  --state-dir='/var/run/cilium'" subsys=daemon
level=info msg="  --synchronize-k8s-nodes='true'" subsys=daemon
level=info msg="  --tofqdns-dns-reject-response-code='refused'" subsys=daemon
level=info msg="  --tofqdns-enable-dns-compression='true'" subsys=daemon
level=info msg="  --tofqdns-endpoint-max-ip-per-hostname='50'" subsys=daemon
level=info msg="  --tofqdns-idle-connection-grace-period='0s'" subsys=daemon
level=info msg="  --tofqdns-max-deferred-connection-deletes='10000'" subsys=daemon
level=info msg="  --tofqdns-min-ttl='3600'" subsys=daemon
level=info msg="  --tofqdns-pre-cache=''" subsys=daemon
level=info msg="  --tofqdns-proxy-port='0'" subsys=daemon
level=info msg="  --tofqdns-proxy-response-max-delay='100ms'" subsys=daemon
level=info msg="  --trace-payloadlen='128'" subsys=daemon
level=info msg="  --trace-sock='true'" subsys=daemon
level=info msg="  --tunnel='disabled'" subsys=daemon
level=info msg="  --tunnel-port='0'" subsys=daemon
level=info msg="  --unmanaged-pod-watcher-interval='15'" subsys=daemon
level=info msg="  --version='false'" subsys=daemon
level=info msg="  --vlan-bpf-bypass=''" subsys=daemon
level=info msg="  --vtep-cidr=''" subsys=daemon
level=info msg="  --vtep-endpoint=''" subsys=daemon
level=info msg="  --vtep-mac=''" subsys=daemon
level=info msg="  --vtep-mask=''" subsys=daemon
level=info msg="  --write-cni-conf-when-ready=''" subsys=daemon
level=info msg="     _ _ _" subsys=daemon
level=info msg=" ___|_| |_|_ _ _____" subsys=daemon
level=info msg="|  _| | | | | |     |" subsys=daemon
level=info msg="|___|_|_|_|___|_|_|_|" subsys=daemon
level=info msg="Cilium 1.13.90 dab8723c 2022-12-13T13:47:46-08:00 go version go1.19.3 linux/amd64" subsys=daemon
level=info msg="cilium-envoy  version: ff9a74a473dd5c1f6bfd7ce10938dab648f70c09/1.21.5/Distribution/RELEASE/BoringSSL" subsys=daemon
level=info msg="clang (10.0.0) and kernel (6.0.0) versions: OK!" subsys=linux-datapath
level=info msg="linking environment: OK!" subsys=linux-datapath
level=info msg="Detected mounted BPF filesystem at /sys/fs/bpf" subsys=bpf
level=info msg="Mounted cgroupv2 filesystem at /run/cilium/cgroupv2" subsys=cgroups
level=info msg="Parsing base label prefixes from default label list" subsys=labels-filter
level=info msg="Parsing additional label prefixes from user inputs: []" subsys=labels-filter
level=info msg="Final label prefixes to be used for identity evaluation:" subsys=labels-filter
level=info msg=" - reserved:.*" subsys=labels-filter
level=info msg=" - :io\\.kubernetes\\.pod\\.namespace" subsys=labels-filter
level=info msg=" - :io\\.cilium\\.k8s\\.namespace\\.labels" subsys=labels-filter
level=info msg=" - :app\\.kubernetes\\.io" subsys=labels-filter
level=info msg=" - !:io\\.kubernetes" subsys=labels-filter
level=info msg=" - !:kubernetes\\.io" subsys=labels-filter
level=info msg=" - !:.*beta\\.kubernetes\\.io" subsys=labels-filter
level=info msg=" - !:k8s\\.io" subsys=labels-filter
level=info msg=" - !:pod-template-generation" subsys=labels-filter
level=info msg=" - !:pod-template-hash" subsys=labels-filter
level=info msg=" - !:controller-revision-hash" subsys=labels-filter
level=info msg=" - !:annotation.*" subsys=labels-filter
level=info msg=" - !:etcd_node" subsys=labels-filter
level=info msg="Using autogenerated IPv4 allocation range" subsys=node v4Prefix=10.20.0.0/16
level=info msg=Invoked duration="660.054µs" function="gops.registerGopsHooks (cell.go:39)" subsys=hive
level=info msg=Invoked duration="756.668µs" function="cmd.glob..func2 (daemon_main.go:1635)" subsys=hive
level=info msg="Started gops server" address="127.0.0.1:9890" subsys=gops
level=info msg="Start hook executed" duration="303.727µs" function="gops.registerGopsHooks.func1 (cell.go:44)" subsys=hive
level=info msg="Establishing connection to apiserver" host="https://192.168.5.20:6443" subsys=k8s-client
level=info msg="Connected to apiserver" subsys=k8s-client
level=info msg="Start hook executed" duration=10.057219ms function="client.(*compositeClientset).onStart" subsys=hive
level=info msg="Start hook executed" duration=18.781373ms function="cmd.newDatapath.func1 (daemon_main.go:1615)" subsys=hive
level=info msg="Start hook executed" duration="3.391µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Service].Start" subsys=hive
level=info msg="Start hook executed" duration=100.113257ms function="*gobgp.diffStore[*github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1.Service].Start" subsys=hive
level=info msg="Start hook executed" duration="18.586µs" function="*resource.resource[*github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2alpha1.CiliumBGPPeeringPolicy].Start" subsys=hive
level=info msg="Start hook executed" duration="3.345µs" function="*agent.localNodeStoreSpecer.Start" subsys=hive
level=info msg="Start hook executed" duration=100.658828ms function="*agent.Controller.Start" subsys=hive
level=info msg="Cilium BGP Control Plane Controller now running..." component=Controller.Run subsys=bgp-control-plane
level=info msg="Cilium BGP Control Plane Controller woken for reconciliation" component=Controller.Run subsys=bgp-control-plane
level=info msg="Using Managed Neighbor Kernel support" subsys=daemon
level=info msg="Auto-enabling \"enable-node-port\", \"enable-external-ips\", \"bpf-lb-sock\", \"enable-host-port\", \"enable-session-affinity\" features" subsys=daemon
level=info msg="Inheriting MTU from external network interface" device=enp2s0 ipAddr=192.168.5.20 mtu=1500 subsys=mtu
level=info msg="Restoring 2 old CIDR identities" subsys=daemon
level=info msg="Cgroup metadata manager is enabled" subsys=cgroup-manager
level=info msg="Envoy: Starting xDS gRPC server listening on /var/run/cilium/xds.sock" subsys=envoy-manager
level=info msg="Restored backends from maps" failedBackends=0 restoredBackends=71 subsys=service
level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
level=info msg="Restored services from maps" failedServices=0 restoredServices=104 subsys=service
level=info msg="Reading old endpoints..." subsys=daemon
level=info msg="Reusing previous DNS proxy port: 36355" subsys=daemon
level=info msg="Waiting until all Cilium CRDs are available" subsys=k8s
level=info msg="All Cilium CRDs have been found and are available" subsys=k8s
level=info msg="Retrieved node information from kubernetes node" nodeName=tsunade subsys=k8s
level=info msg="Received own node information from API server" ipAddr.ipv4=192.168.5.20 ipAddr.ipv6="<nil>" k8sNodeIP=192.168.5.20 labels="map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/os:linux kubernetes.io/arch:amd64 kubernetes.io/hostname:tsunade kubernetes.io/os:linux openebs.io/nodeid:tsunade openebs.io/nodename:tsunade]" nodeName=tsunade subsys=k8s v4Prefix=10.244.0.0/24 v6Prefix="<nil>"
level=info msg="k8s mode: Allowing localhost to reach local endpoints" subsys=daemon
level=info msg="Cilium BGP Control Plane Controller woken for reconciliation" component=Controller.Run subsys=bgp-control-plane
level=info msg="Registering GoBGP servers for policy with local ASN 64512" component=gobgp.RouterManager.registerBGPServer subsys=bgp-control-plane
level=info msg="Reconciling peers for virtual router with local ASN 64512" component=gobgp.neighborReconciler subsys=bgp-control-plane
level=info msg="Adding peer 192.168.5.1/32 64512 to local ASN 64512" component=gobgp.neighborReconciler subsys=bgp-control-plane
level=info msg="Add a peer configuration" Key=192.168.5.1 Topic=Peer asn=64512 component=gobgp.BgpServerInstance subsys=bgp-control-plane
level=info msg="Done reconciling peers for virtual router with local ASN 64512" component=gobgp.neighborReconciler subsys=bgp-control-plane
level=info msg="type:STATE  peer:{conf:{local_asn:64512  neighbor_address:\"192.168.5.1\"  peer_asn:64512}  state:{local_asn:64512  neighbor_address:\"192.168.5.1\"  peer_asn:64512  session_state:IDLE  router_id:\"<nil>\"}  transport:{local_address:\"<nil>\"}}"
level=info msg="type:STATE  peer:{conf:{local_asn:64512  neighbor_address:\"192.168.5.1\"  peer_asn:64512}  state:{local_asn:64512  neighbor_address:\"192.168.5.1\"  peer_asn:64512  session_state:ACTIVE  router_id:\"<nil>\"}  transport:{local_address:\"<nil>\"}}"
level=info msg="Direct routing device detected" direct-routing-device=enp2s0 subsys=linux-datapath
level=info msg="Detected devices" devices="[enp2s0]" subsys=linux-datapath
level=warning msg="both egress gateway and L7 proxy (--enable-l7-proxy) are enabled. This is currently not fully supported: if the same endpoint is selected both by an egress gateway and a L7 policy, endpoint traffic will not go through egress gateway." subsys=daemon url="https://github.com/cilium/cilium/issues/19642"
level=info msg="Masquerading IP selected for device" device=enp2s0 ipv4=192.168.5.20 subsys=node
level=info msg="Successfully registered GoBGP servers for policy with local ASN 64512" component=gobgp.RouterManager.registerBGPServer subsys=bgp-control-plane
level=info msg="Cilium BGP Control Plane Controller woken for reconciliation" component=Controller.Run subsys=bgp-control-plane
level=info msg="Done reconciling peers for virtual router with local ASN 64512" component=gobgp.neighborReconciler subsys=bgp-control-plane
level=info msg="Setting gso_max_size to 65536 and gro_max_size to 65536" device=enp2s0 subsys=big-tcp
level=info msg="Enabling k8s event listener" subsys=k8s-watcher
level=info msg="Removing stale endpoint interfaces" subsys=daemon
level=info msg="Skipping kvstore configuration" subsys=daemon
level=info msg="Restored router address from node_config" file=/var/run/cilium/state/globals/node_config.h ipv4=10.244.0.117 ipv6="<nil>" subsys=node
level=info msg="Initializing node addressing" subsys=daemon
level=info msg="Initializing kubernetes IPAM" subsys=ipam v4Prefix=10.244.0.0/24 v6Prefix="<nil>"
level=info msg="Restoring endpoints..." subsys=daemon
level=info msg="Waiting until local node addressing before starting watchers depending on it" subsys=k8s-watcher
level=info msg="Cilium BGP Control Plane Controller woken for reconciliation" component=Controller.Run subsys=bgp-control-plane
level=info msg="Done reconciling peers for virtual router with local ASN 64512" component=gobgp.neighborReconciler subsys=bgp-control-plane
level=info msg="Cilium BGP Control Plane Controller woken for reconciliation" component=Controller.Run subsys=bgp-control-plane
level=info msg="Done reconciling peers for virtual router with local ASN 64512" component=gobgp.neighborReconciler subsys=bgp-control-plane
level=info msg="Policy Add Request" ciliumNetworkPolicy="[&{EndpointSelector:{\"matchLabels\":{\"any:app.kubernetes.io/instance\":\"argocd\",\"k8s:io.kubernetes.pod.namespace\":\"argo\"}} NodeSelector:{} Ingress:[{IngressCommonRule:{FromEndpoints:[{\"matchLabels\":{\"any:app.kubernetes.io/instance\":\"argocd\",\"k8s:io.kubernetes.pod.namespace\":\"argo\"}}] FromRequires:[] FromCIDR: FromCIDRSet:[] FromEntities:[] aggregatedSelectors:[]} ToPorts:[] ICMPs:[]} {IngressCommonRule:{FromEndpoints:[{\"matchLabels\":{\"any:io.kubernetes.pod.namespace\":\"monitoring\",\"k8s:app.kubernetes.io/name\":\"prometheus\"}}] FromRequires:[] FromCIDR: FromCIDRSet:[] FromEntities:[] aggregatedSelectors:[]} ToPorts:[] ICMPs:[]} {IngressCommonRule:{FromEndpoints:[{\"matchLabels\":{\"any:app.kubernetes.io/instance\":\"ingress-nginx\",\"any:io.kubernetes.pod.namespace\":\"kube-system\"}}] FromRequires:[] FromCIDR: FromCIDRSet:[] FromEntities:[] aggregatedSelectors:[]} ToPorts:[{Ports:[{Port:8080 Protocol:ANY}] TerminatingTLS:<nil> OriginatingTLS:<nil> ServerNames:[] Listener:<nil> Rules:<nil>}] ICMPs:[]}] IngressDeny:[] Egress:[{EgressCommonRule:{ToEndpoints:[] ToRequires:[] ToCIDR: ToCIDRSet:[] ToEntities:[host] ToServices:[] ToGroups:[] aggregatedSelectors:[{LabelSelector:0xc000073720 requirements:0xc000011a88 cachedLabelSelectorString:&LabelSelector{MatchLabels:map[string]string{reserved.host: ,},MatchExpressions:[]LabelSelectorRequirement{},}}]} ToPorts:[{Ports:[{Port:6443 Protocol:ANY}] TerminatingTLS:<nil> OriginatingTLS:<nil> ServerNames:[] Listener:<nil> Rules:<nil>}] ToFQDNs:[] ICMPs:[]} {EgressCommonRule:{ToEndpoints:[] ToRequires:[] ToCIDR: ToCIDRSet:[] ToEntities:[kube-apiserver] ToServices:[] ToGroups:[] aggregatedSelectors:[{LabelSelector:0xc000073800 requirements:0xc000011d28 cachedLabelSelectorString:&LabelSelector{MatchLabels:map[string]string{reserved.kube-apiserver: ,},MatchExpressions:[]LabelSelectorRequirement{},}}]} ToPorts:[{Ports:[{Port:6443 Protocol:ANY}] TerminatingTLS:<nil> OriginatingTLS:<nil> ServerNames:[] Listener:<nil> Rules:<nil>}] ToFQDNs:[] ICMPs:[]} {EgressCommonRule:{ToEndpoints:[] ToRequires:[] ToCIDR: ToCIDRSet:[10.0.1.30/32] ToEntities:[] ToServices:[] ToGroups:[] aggregatedSelectors:[{LabelSelector:0xc001a83220 requirements:0xc001a966d8 cachedLabelSelectorString:&LabelSelector{MatchLabels:map[string]string{cidr.10.0.1.30/32: ,},MatchExpressions:[]LabelSelectorRequirement{},}}]} ToPorts:[{Ports:[{Port:6443 Protocol:ANY}] TerminatingTLS:<nil> OriginatingTLS:<nil> ServerNames:[] Listener:<nil> Rules:<nil>}] ToFQDNs:[] ICMPs:[]} {EgressCommonRule:{ToEndpoints:[] ToRequires:[] ToCIDR: ToCIDRSet:[] ToEntities:[] ToServices:[] ToGroups:[] aggregatedSelectors:[{LabelSelector:0xc0000737c0 requirements:0xc000011c68 cachedLabelSelectorString:&LabelSelector{MatchLabels:map[string]string{reserved.none: ,},MatchExpressions:[]LabelSelectorRequirement{},}}]} ToPorts:[{Ports:[{Port:2244 Protocol:ANY}] TerminatingTLS:<nil> OriginatingTLS:<nil> ServerNames:[] Listener:<nil> Rules:<nil>}] ToFQDNs:[{MatchName:git.ripa.io MatchPattern:}] ICMPs:[]} {EgressCommonRule:{ToEndpoints:[{\"matchLabels\":{\"any:io.kubernetes.pod.namespace\":\"kube-system\",\"any:k8s-app\":\"kube-dns\"}}] ToRequires:[] ToCIDR: ToCIDRSet:[] ToEntities:[] ToServices:[] ToGroups:[] aggregatedSelectors:[]} ToPorts:[{Ports:[{Port:53 Protocol:UDP}] TerminatingTLS:<nil> OriginatingTLS:<nil> ServerNames:[] Listener:<nil> Rules:0xc00060abd0}] ToFQDNs:[] ICMPs:[]} {EgressCommonRule:{ToEndpoints:[{\"matchLabels\":{\"any:app.kubernetes.io/instance\":\"argocd\",\"k8s:io.kubernetes.pod.namespace\":\"argo\"}}] ToRequires:[] ToCIDR: ToCIDRSet:[] ToEntities:[] ToServices:[] ToGroups:[] aggregatedSelectors:[]} ToPorts:[] ToFQDNs:[] ICMPs:[]}] EgressDeny:[] Labels:[k8s:io.cilium.k8s.policy.derived-from=CiliumNetworkPolicy k8s:io.cilium.k8s.policy.name=argocd-policy k8s:io.cilium.k8s.policy.namespace=argo k8s:io.cilium.k8s.policy.uid=31c94437-19d0-4f26-a055-426b2a764cbd] Description:}]" policyAddRequest=e54bb4dd-bd07-49ca-9f5e-6b8f449e071e subsys=daemon
level=info msg="Policy imported via API, recalculating..." policyAddRequest=e54bb4dd-bd07-49ca-9f5e-6b8f449e071e policyRevision=2 subsys=daemon
level=info msg="Imported CiliumNetworkPolicy" ciliumNetworkPolicyName=argocd-policy k8sApiVersion=cilium.io/v2 k8sNamespace=argo subsys=k8s-watcher
level=info msg="Policy Add Request" ciliumNetworkPolicy="[&{EndpointSelector:{\"matchLabels\":{\"k8s:io.kubernetes.pod.namespace\":\"argo\"}} NodeSelector:{} Ingress:[{IngressCommonRule:{FromEndpoints:[] FromRequires:[] FromCIDR: FromCIDRSet:[] FromEntities:[] aggregatedSelectors:[]} ToPorts:[] ICMPs:[]}] IngressDeny:[] Egress:[{EgressCommonRule:{ToEndpoints:[] ToRequires:[] ToCIDR: ToCIDRSet:[] ToEntities:[] ToServices:[] ToGroups:[] aggregatedSelectors:[]} ToPorts:[] ToFQDNs:[] ICMPs:[]}] EgressDeny:[] Labels:[k8s:io.cilium.k8s.policy.derived-from=CiliumNetworkPolicy k8s:io.cilium.k8s.policy.name=deny-all-traffic k8s:io.cilium.k8s.policy.namespace=argo k8s:io.cilium.k8s.policy.uid=2d716d91-32c7-4e8a-8c93-98dc8a9ddc9f] Description:}]" policyAddRequest=d2425ee2-dd61-467e-9863-6e46f815b779 subsys=daemon
level=info msg="Policy imported via API, recalculating..." policyAddRequest=d2425ee2-dd61-467e-9863-6e46f815b779 policyRevision=3 subsys=daemon
level=info msg="Imported CiliumNetworkPolicy" ciliumNetworkPolicyName=deny-all-traffic k8sApiVersion=cilium.io/v2 k8sNamespace=argo subsys=k8s-watcher
level=info msg="Endpoints restored" failed=0 restored=66 subsys=daemon
level=info msg="Addressing information:" subsys=daemon
level=info msg="  Cluster-Name: artemis" subsys=daemon
level=info msg="  Cluster-ID: 20" subsys=daemon
level=info msg="  Local node-name: tsunade" subsys=daemon
level=info msg="  Node-IPv6: <nil>" subsys=daemon
level=info msg="  External-Node IPv4: 192.168.5.20" subsys=daemon
level=info msg="  Internal-Node IPv4: 10.244.0.117" subsys=daemon
level=info msg="  IPv4 allocation prefix: 10.244.0.0/24" subsys=daemon
level=info msg="  IPv4 native routing prefix: 192.168.5.0/24" subsys=daemon
level=info msg="  Loopback IPv4: 169.254.42.1" subsys=daemon
level=info msg="Cilium BGP Control Plane Controller woken for reconciliation" component=Controller.Run subsys=bgp-control-plane
level=info msg="  Local IPv4 addresses:" subsys=daemon
level=info msg="  - 192.168.5.20" subsys=daemon
level=info msg="  - 10.244.0.117" subsys=daemon
level=info msg="Creating or updating CiliumNode resource" node=tsunade subsys=nodediscovery
level=info msg="Adding local node to cluster" node="{tsunade artemis [{InternalIP 192.168.5.20} {CiliumInternalIP 10.244.0.117}] 10.244.0.0/24 [] <nil> [] 10.244.0.161 <nil> <nil> <nil> 20 local 0 map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/os:linux kubernetes.io/arch:amd64 kubernetes.io/hostname:tsunade kubernetes.io/os:linux openebs.io/nodeid:tsunade openebs.io/nodename:tsunade] map[] 1 }" subsys=nodediscovery
level=info msg="Done reconciling peers for virtual router with local ASN 64512" component=gobgp.neighborReconciler subsys=bgp-control-plane
level=info msg="Cilium BGP Control Plane Controller woken for reconciliation" component=Controller.Run subsys=bgp-control-plane
level=info msg="Done reconciling peers for virtual router with local ASN 64512" component=gobgp.neighborReconciler subsys=bgp-control-plane
level=info msg="Waiting until all pre-existing resources have been received" subsys=k8s-watcher
level=info msg="Annotating k8s node" subsys=daemon v4CiliumHostIP.IPv4=10.244.0.117 v4IngressIP.IPv4="<nil>" v4Prefix=10.244.0.0/24 v4healthIP.IPv4=10.244.0.161 v6CiliumHostIP.IPv6="<nil>" v6IngressIP.IPv6="<nil>" v6Prefix="<nil>" v6healthIP.IPv6="<nil>"
level=info msg="Initializing identity allocator" subsys=identity-cache
level=info msg="Allocating identities between range" cluster-id=20 max=65535 min=256 subsys=identity-cache
level=info msg="Initializing ClusterMesh routing" path=/var/lib/cilium/clustermesh/ subsys=daemon
level=info msg="Setting up BPF bandwidth manager" subsys=bandwidth-manager
level=info msg="Skip setting sysctl as it already meets baseline" baselineValue=1000 subsys=bandwidth-manager sysParamName=net.core.netdev_max_backlog sysParamValue=1000
level=info msg="Skip setting sysctl as it already meets baseline" baselineValue=4096 subsys=bandwidth-manager sysParamName=net.core.somaxconn sysParamValue=4096
level=info msg="Skip setting sysctl as it already meets baseline" baselineValue=4096 subsys=bandwidth-manager sysParamName=net.ipv4.tcp_max_syn_backlog sysParamValue=4096
level=info msg="Setting sysctl to baseline for BPF bandwidth manager" baselineValue=fq subsys=bandwidth-manager sysParamName=net.core.default_qdisc
level=info msg="Setting sysctl to baseline for BPF bandwidth manager" baselineValue=bbr subsys=bandwidth-manager sysParamName=net.ipv4.tcp_congestion_control
level=info msg="Setting sysctl to baseline for BPF bandwidth manager" baselineValue=0 subsys=bandwidth-manager sysParamName=net.ipv4.tcp_slow_start_after_idle
level=info msg="Setting qdisc to fq" device=enp2s0 subsys=bandwidth-manager
level=info msg="Setting up BPF datapath" bpfClockSource=ktime bpfInsnSet=v3 subsys=datapath-loader
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.core.bpf_jit_enable sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.all.rp_filter sysParamValue=0
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.fib_multipath_use_neigh sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=kernel.unprivileged_bpf_disabled sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=kernel.timer_migration sysParamValue=0
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.forwarding sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.rp_filter sysParamValue=0
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.accept_local sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.send_redirects sysParamValue=0
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.forwarding sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.rp_filter sysParamValue=0
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.accept_local sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.send_redirects sysParamValue=0
level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
level=info msg="Trace[607811211]: \"DeltaFIFO Pop Process\" ID:argo/argocd-repo-server-565fcd95dc-tm6vk,Depth:64,Reason:slow event handlers blocking the queue (17-Dec-2022 17:42:02.938) (total time: 197ms):" subsys=klog
level=info msg="Trace[607811211]: [197.456372ms] [197.456372ms] END" subsys=klog
level=info msg="type:STATE  peer:{conf:{local_asn:64512  neighbor_address:\"192.168.5.1\"  peer_asn:64512}  state:{local_asn:64512  neighbor_address:\"192.168.5.1\"  peer_asn:64512  session_state:OPENSENT  router_id:\"<nil>\"}  transport:{local_address:\"192.168.5.20\"  local_port:33543  remote_port:179}}"
level=info msg="type:STATE  peer:{conf:{local_asn:64512  neighbor_address:\"192.168.5.1\"  peer_asn:64512}  state:{local_asn:64512  neighbor_address:\"192.168.5.1\"  peer_asn:64512  session_state:OPENCONFIRM  router_id:\"192.168.5.1\"}  transport:{local_address:\"192.168.5.20\"  local_port:33543  remote_port:179}}"
level=info msg="Peer Up" Key=192.168.5.1 State=BGP_FSM_OPENCONFIRM Topic=Peer asn=64512 component=gobgp.BgpServerInstance subsys=bgp-control-plane
level=info msg="type:STATE  peer:{conf:{local_asn:64512  neighbor_address:\"192.168.5.1\"  peer_asn:64512}  state:{local_asn:64512  neighbor_address:\"192.168.5.1\"  peer_asn:64512  session_state:ESTABLISHED  router_id:\"192.168.5.1\"}  transport:{local_address:\"192.168.5.20\"  local_port:33543  remote_port:179}}"
level=info msg="Iptables rules installed" subsys=iptables
level=info msg="Adding new proxy port rules for cilium-dns-egress:36355" id=cilium-dns-egress subsys=proxy
level=info msg="Iptables proxy rules installed" subsys=iptables
level=info msg="Serving cilium node monitor v1.2 API at unix:///var/run/cilium/monitor1_2.sock" subsys=monitor-agent
level=info msg="Start hook executed" duration=12.66176889s function="cmd.newDaemonPromise.func1 (daemon_main.go:1668)" subsys=hive
level=info msg="Initializing daemon" subsys=daemon
level=info msg="Validating configured node address ranges" subsys=daemon
level=info msg="Starting connection tracking garbage collector" subsys=daemon
level=info msg="Starting IP identity watcher" subsys=ipcache
level=info msg="Initial scan of connection tracking completed" subsys=ct-gc
level=info msg="Regenerating restored endpoints" numRestored=66 subsys=daemon
level=info msg="Datapath signal listener running" subsys=signal

Anything else?

linux neighbours

while having the problem on a client with IP 192.168.5.157 I cannot see it in the ip neigh on the worker node:

$ ip neigh show
192.168.5.210 dev enp2s0 lladdr 18:3e:ef:eb:0c:27 REACHABLE
192.168.5.222 dev enp2s0 lladdr 30:83:98:b2:89:59 STALE
192.168.5.111 dev enp2s0 lladdr 00:17:88:6e:9a:88 REACHABLE
192.168.5.58 dev enp2s0 lladdr 00:11:32:f4:ee:ff REACHABLE
192.168.5.1 dev enp2s0 lladdr fc:ec:da:4a:15:a7 REACHABLE

Once I initiate a session to the node, I can observe the client showing up on the hosts neighbour list:

$ ip neigh show
192.168.5.210 dev enp2s0 lladdr 18:3e:ef:eb:0c:27 REACHABLE
192.168.5.157 dev enp2s0 lladdr 64:4b:f0:02:00:b3 REACHABLE
192.168.5.222 dev enp2s0 lladdr 30:83:98:b2:89:59 REACHABLE
192.168.5.111 dev enp2s0 lladdr 00:17:88:6e:9a:88 REACHABLE
192.168.5.58 dev enp2s0 lladdr 00:11:32:f4:ee:ff DELAY
192.168.5.1 dev enp2s0 lladdr fc:ec:da:4a:15:a7 REACHABLE

Cilium configuration

configuration options

Using Helm, relevant values:

kubeProxyReplacement: strict
autoDirectNodeRoutes: true

ipv4NativeRoutingCIDR: 192.168.5.0/24
enableIPv4Masquerade: true

pmtuDiscovery:
  enabled: true

bgpControlPlane:
  enabled: true

bandwidthManager:
  enabled: true
  bbr: true

bpf:
  masquerade: true

ipMasqAgent:
  enabled: true

k8sServiceHost: "192.168.5.20"
k8sServicePort: "6443"
tunnel: disabled
ipv6:
  enabled: false
ipam:
  mode: kubernetes
hubble:
  enabled: true
  relay:
    enabled: true

egressGateway:
  enabled: true

loadBalancer:
  mode: hybrid
  # not supported by driver
  acceleration: disabled

annotateK8sNode: true

status

$ kubectl -n kube-system exec ds/cilium -- cilium status --verbose

KVStore:                Ok   Disabled
Kubernetes:             Ok   1.24 (v1.24.8+k0s) [linux/amd64]
Kubernetes APIs:        ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEgressGatewayPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:   Strict   [enp2s0 192.168.5.20 (Direct Routing)]
Host firewall:          Disabled
CNI Chaining:           none
CNI Config file:        CNI configuration file management disabled
Cilium:                 Ok   1.13.90 (v1.13.90-dab8723c)
NodeMonitor:            Listening for events on 4 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok
IPAM:                   IPv4: 67/254 allocated from 10.244.0.0/24,
Allocated addresses:
...
... <removed for readability>
...
ClusterMesh:            0/0 clusters ready, 0 global-services
IPv6 BIG TCP:           Disabled
BandwidthManager:       EDT with BPF [BBR] [enp2s0]
Host Routing:           BPF
Masquerading:           BPF (ip-masq-agent)   [enp2s0]   192.168.5.0/24 [IPv4: Enabled, IPv6: Disabled]
Clock Source for BPF:   ktime
Controller Status:      344/344 healthy
  Name                                          Last success   Last error     Count   Message
...
... <removed for readability, all listed with "no error">
...
Proxy Status:   OK, ip 10.244.0.117, 7 redirects active on ports 10000-20000
  Protocol            Redirect             Proxy Port
...
... <removed for readability>
...
Global Identity Range:   min 256, max 65535
Hubble:                  Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 282.63   Metrics: Ok
KubeProxyReplacement Details:
  Status:                 Strict
  Socket LB:              Enabled
  Socket LB Tracing:      Enabled
  Devices:                enp2s0 192.168.5.20 (Direct Routing)
  Mode:                   Hybrid
  Backend Selection:      Random
  Session Affinity:       Enabled
  Graceful Termination:   Enabled
  NAT46/64 Support:       Disabled
  XDP Acceleration:       Disabled
  Services:
  - ClusterIP:      Enabled
  - NodePort:       Enabled (Range: 30000-32767)
  - LoadBalancer:   Enabled
  - externalIPs:    Enabled
  - HostPort:       Enabled
BPF Maps:   dynamic sizing: on (ratio: 0.002500)
...
... <removed for readability>
...
Encryption:                     Disabled
Cluster health:                 1/1 reachable   (2022-12-17T20:26:20Z)
  Name                          IP              Node        Endpoints
  mycluster/node01 (localhost)   192.168.5.20    reachable   reachable

BGP configuration

$ kubectl get ciliumloadbalancerippools.cilium.io -o yaml

apiVersion: v1
items:
- apiVersion: cilium.io/v2alpha1
  kind: CiliumLoadBalancerIPPool
  metadata:
    name: main-pool
    resourceVersion: "86649503"
    uid: 1b22868d-7f42-4024-b943-39e964909aed
  spec:
    cidrs:
    - cidr: 192.168.20.0/24
    disabled: false
  status:
    conditions:
    - lastTransitionTime: "2022-12-14T19:42:31Z"
      message: ""
      observedGeneration: 1
      reason: resolved
      status: "False"
      type: io.cilium/conflict
    - lastTransitionTime: "2022-12-14T19:42:31Z"
      message: "254"
      observedGeneration: 1
      reason: noreason
      status: Unknown
      type: io.cilium/ips-total
    - lastTransitionTime: "2022-12-15T17:08:24Z"
      message: "250"
      observedGeneration: 1
      reason: noreason
      status: Unknown
      type: io.cilium/ips-available
    - lastTransitionTime: "2022-12-15T17:08:24Z"
      message: "4"
      observedGeneration: 1
      reason: noreason
      status: Unknown
      type: io.cilium/ips-used
kind: List
metadata:
  resourceVersion: ""

$ kubectl get ciliumbgppeeringpolicies.cilium.io -o yaml main

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeeringPolicy
metadata:
  name: main
  resourceVersion: "86268976"
  uid: f8b0b874-952c-4263-9fe6-62ab9c9f4fb0
spec:
  nodeSelector:
    matchLabels:
      kubernetes.io/os: linux
  virtualRouters:
  - exportPodCIDR: true
    localASN: 64512
    neighbors:
    - peerASN: 64512
      peerAddress: 192.168.5.1/32
    serviceSelector:
      matchExpressions:
      - key: someKeyThatDoesNotExist
        operator: DoesNotExist

Code of Conduct

[X] I agree to follow this project's Code of Conduct

borkmann commented 1 year ago

Hi @eripa, thanks so much for the detailed bug report!

One question, if you remove the hybrid SNAT/DSR mode and just keep it to SNAT only.. essentially this line:

  mode: hybrid

... would it start to work then?

eripa commented 1 year ago

Hi @eripa, thanks so much for the detailed bug report!

One question, if you remove the hybrid SNAT/DSR mode and just keep it to SNAT only.. essentially this line:
  mode: hybrid
... would it start to work then?

Thanks @borkmann, unfortunately this doesn't make any difference for this issue. I've tried SNAT, DSR and hybrid with the same results.

borkmann commented 1 year ago

Thanks @borkmann, unfortunately this doesn't make any difference for this issue. I've tried SNAT, DSR and hybrid with the same results.

Ok, with regards to your description about the neighbor issue.. you observe this happening on the worker node which contains the backend (so not the LB node), is that correct? Could you run cilium monitor -t drop from the Cilium Pod to see potential drop reasons. Do you see on tcpdump any attempts to try to resolve the 192.168.5.157 client?

eripa commented 1 year ago

In this case the backend is the LB, since the backend pod is hosted on the node and the node is announcing the LoadBalancer IP.

Below is the router's routing table, where you can see the LoadBalancer IP (192.168.20.53/32) being announced and available at the node IP (192.168.5.20).

The cilium monitor doesn't show any drops, there are also no policies involved here so I wouldn't expect it to drop anything. There are a few sporadic router RouterSolicitation drops, but those are IPv6 specific (not using IPv6).

Finally, I don't see any ARP lookups when hitting the LoadBalancer, but I do see them when I hit the backend Pod IP directly. And once that's done the Loadbalancer IP will work until the ARP cache expires.

eric@ubnt:~$ show ip route bgp

IP Route Table for VRF "default"
B    *> 10.244.0.0/24 [200/0] via 192.168.5.20, eth1, 03:03:27
B    *> 192.168.20.1/32 [200/0] via 192.168.5.20, eth1, 03:03:27
B    *> 192.168.20.5/32 [200/0] via 192.168.5.20, eth1, 03:03:27
B    *> 192.168.20.50/32 [200/0] via 192.168.5.20, eth1, 03:03:27
B    *> 192.168.20.53/32 [200/0] via 192.168.5.20, eth1, 03:03:27

dylandreimerink commented 1 year ago

Thanks for the detailed report. I think we were able to figure out what is going on.

If we first look at the PCAPs of the scenario where LB traffic doesn't work, I see the following:

Client sends a SYN packet, with the MAC address of the router (gateway)
Router does a lookup in the routing table and forwards to the node
Node sends the response traffic back to router
Router drops the packet (Not 100% sure why, I suspect because it doesn't have a route for 192.168.5.157 and for some reason it doesn't fall back to L2)

If we then look at the PCAPs of for the Pod IP where it does work, i see the following:

Client sends a SYN packet, with the MAC address of the router (gateway)
Router does a lookup in the routing table and forwards to the node
Node perform an ARP request for 192.168.5.157
Client responds to node with its MAC address
Node sends return traffic directly to client via L2
So client -> node traffic goes via L3 over the router
And node -> client traffic goes directly to the node via L2 (doesn't appear on the routers PCAP).

Now that node has the ARP record for the client the LB works since the node will at that point start using the ARP record for return traffic for the LB VIP as well.

So we have 3 issues here:

First is that your router doesn't handle return traffic to the client properly.
The second is that the return path of Cilium should perform an ARP request as well (this seems to be the correct behavior when the node has at least one IP which is in the same subnet as the destination IP).
The third is that even after Cilium is updated, your setup will always have an asymmetric path since the client and node are in the same CIDR. Return traffic from the node to your client will always go directly via the switch, which I imagine is not intended.

My suggestion to resolve this issue is to make a dedicated CIDR for clients, lets say 192.168.6.0/24, this should force traffic in both directions to go via the router. Then make sure return traffic can reach the clients, which should be easier to diagnose in that setup since you won't have packets bypassing your router.

Lets keep this issue as a bug report for the incorrect handling of return traffic. Some technical details:

After discussing it with @borkmann the conclusion is that the return traffic in this case will be processed by rev_nodeport_lb{4,6}. Here we perform a FIB lookup, we suspect that this results in a BPF_FIB_LKUP_RET_NO_NEIGH response when no ARP entries exist. At this moment rev_nodeport_lb{4,6} will send traffic back to the neighbor from which we received the traffic in the first place, assuming they can handle the return traffic. It seems the more correct thing to do would be to call redirect_direct_v{4,6} which will invoke the redirect_neigh helper in this case, which does perform the ARP lookup.

eripa commented 1 year ago

Thanks a lot for looking into this and providing all the details and help @dylandreimerink, I really appreciate it. I moved the Kubernetes nodes over to a different subnet (10.0.5.0/24, the LBs on 10.0.20.0/24), and this cleared up the issue I think. There are no longer any clients listed in the Kubernetes node's ARP cache, and there's no ARP lookups. So I believe I'm now running properly on L3 only.

My problem is solved but please feel free to keep the issue open as needed.

Here's a quick tcpdump of the DNS request on the Kubernetes node side now:

11:16:10.398689 enp2s0 In  IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [S], seq 505666507, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1548694393 ecr 0,sackOK,eol], length 0
11:16:10.398778 lxceeafc8fbb3d2 In  IP 10.244.0.154.53 > 192.168.5.157.52932: Flags [S.], seq 3471219395, ack 505666508, win 64308, options [mss 1410,sackOK,TS val 2074525835 ecr 1548694393,nop,wscale 8], length 0
11:16:10.398807 enp2s0 Out IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [S.], seq 3471219395, ack 505666508, win 64308, options [mss 1410,sackOK,TS val 2074525835 ecr 1548694393,nop,wscale 8], length 0
11:16:10.399774 enp2s0 In  IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 1, win 2053, options [nop,nop,TS val 1548694395 ecr 2074525835], length 0
11:16:10.416055 enp2s0 In  IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [P.], seq 1:42, ack 1, win 2053, options [nop,nop,TS val 1548694411 ecr 2074525835], length 41 29579+ [1au] A? debian.org. (39)
11:16:10.416120 lxceeafc8fbb3d2 In  IP 10.244.0.154.53 > 192.168.5.157.52932: Flags [.], ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 0
11:16:10.416156 enp2s0 Out IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [.], ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 0
11:16:10.416697 lxceeafc8fbb3d2 In  IP 10.244.0.154.53 > 192.168.5.157.52932: Flags [P.], seq 1:90, ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 89 29579 3/0/1 A 130.89.148.77, A 128.31.0.62, A 149.20.4.15 (87)
11:16:10.416738 enp2s0 Out IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [P.], seq 1:90, ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 89 29579 3/0/1 A 130.89.148.77, A 128.31.0.62, A 149.20.4.15 (87)
11:16:10.417063 enp2s0 In  IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
11:16:10.418150 enp2s0 In  IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [F.], seq 42, ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
11:16:10.419379 lxceeafc8fbb3d2 In  IP 10.244.0.154.53 > 192.168.5.157.52932: Flags [F.], seq 90, ack 43, win 252, options [nop,nop,TS val 2074525856 ecr 1548694412], length 0
11:16:10.419411 enp2s0 Out IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [F.], seq 90, ack 43, win 252, options [nop,nop,TS val 2074525856 ecr 1548694412], length 0
11:16:10.420221 enp2s0 In  IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 91, win 2051, options [nop,nop,TS val 1548694415 ecr 2074525856], length 0

and here is the same request on the router:

19:16:10.398087 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [S], seq 505666507, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1548694393 ecr 0,sackOK,eol], length 0
19:16:10.398316 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [S], seq 505666507, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1548694393 ecr 0,sackOK,eol], length 0
19:16:10.398622 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [S.], seq 3471219395, ack 505666508, win 64308, options [mss 1410,sackOK,TS val 2074525835 ecr 1548694393,nop,wscale 8], length 0
19:16:10.398786 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [S.], seq 3471219395, ack 505666508, win 64308, options [mss 1410,sackOK,TS val 2074525835 ecr 1548694393,nop,wscale 8], length 0
19:16:10.399214 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 1, win 2053, options [nop,nop,TS val 1548694395 ecr 2074525835], length 0
19:16:10.399408 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 1, win 2053, options [nop,nop,TS val 1548694395 ecr 2074525835], length 0
19:16:10.415922 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [.], ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 0
19:16:10.416112 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [.], ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 0
19:16:10.417662 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [F.], seq 42, ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
19:16:10.417834 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [F.], seq 42, ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
19:16:10.419187 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [F.], seq 90, ack 43, win 252, options [nop,nop,TS val 2074525856 ecr 1548694412], length 0
19:16:10.419359 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [F.], seq 90, ack 43, win 252, options [nop,nop,TS val 2074525856 ecr 1548694412], length 0

and finally the client:

11:16:10.249998 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [S], seq 505666507, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1548694393 ecr 0,sackOK,eol], length 0
11:16:10.250989 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [S.], seq 3471219395, ack 505666508, win 64308, options [mss 1410,sackOK,TS val 2074525835 ecr 1548694393,nop,wscale 8], length 0
11:16:10.251189 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 1, win 2053, options [nop,nop,TS val 1548694395 ecr 2074525835], length 0
11:16:10.267714 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [P.], seq 1:42, ack 1, win 2053, options [nop,nop,TS val 1548694411 ecr 2074525835], length 41 29579+ [1au] A? debian.org. (39)
11:16:10.268309 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [.], ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 0
11:16:10.268706 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [P.], seq 1:90, ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 89 29579 3/0/1 A 130.89.148.77, A 128.31.0.62, A 149.20.4.15 (87)
11:16:10.268783 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
11:16:10.269669 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [F.], seq 42, ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
11:16:10.271607 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [F.], seq 90, ack 43, win 252, options [nop,nop,TS val 2074525856 ecr 1548694412], length 0
11:16:10.271895 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 91, win 2051, options [nop,nop,TS val 1548694415 ecr 2074525856], length 0

borkmann commented 1 year ago

Thanks for testing, yes, this clears it up given traffic goes via default gw. Beginning of Jan, I'm planning to consolidate most of the fib functionality to address issues such as this one.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

borkmann commented 1 year ago

@eripa The https://github.com/cilium/cilium/pull/23884 PR got merged, could you recheck with master if your issue is addressed? Thanks for your help!

cilium / cilium