Closed eripa closed 1 year ago
Hi @eripa, thanks so much for the detailed bug report!
One question, if you remove the hybrid SNAT/DSR mode and just keep it to SNAT only.. essentially this line:
mode: hybrid
... would it start to work then?
Hi @eripa, thanks so much for the detailed bug report!
One question, if you remove the hybrid SNAT/DSR mode and just keep it to SNAT only.. essentially this line:
mode: hybrid
... would it start to work then?
Thanks @borkmann, unfortunately this doesn't make any difference for this issue. I've tried SNAT, DSR and hybrid with the same results.
Thanks @borkmann, unfortunately this doesn't make any difference for this issue. I've tried SNAT, DSR and hybrid with the same results.
Ok, with regards to your description about the neighbor issue.. you observe this happening on the worker node which contains the backend (so not the LB node), is that correct? Could you run cilium monitor -t drop
from the Cilium Pod to see potential drop reasons. Do you see on tcpdump any attempts to try to resolve the 192.168.5.157 client?
In this case the backend is the LB, since the backend pod is hosted on the node and the node is announcing the LoadBalancer IP.
Below is the router's routing table, where you can see the LoadBalancer IP (192.168.20.53/32
) being announced and available at the node IP (192.168.5.20
).
The cilium monitor
doesn't show any drops, there are also no policies involved here so I wouldn't expect it to drop anything. There are a few sporadic router RouterSolicitation drops, but those are IPv6 specific (not using IPv6).
Finally, I don't see any ARP lookups when hitting the LoadBalancer, but I do see them when I hit the backend Pod
IP directly. And once that's done the Loadbalancer IP will work until the ARP cache expires.
eric@ubnt:~$ show ip route bgp
IP Route Table for VRF "default"
B *> 10.244.0.0/24 [200/0] via 192.168.5.20, eth1, 03:03:27
B *> 192.168.20.1/32 [200/0] via 192.168.5.20, eth1, 03:03:27
B *> 192.168.20.5/32 [200/0] via 192.168.5.20, eth1, 03:03:27
B *> 192.168.20.50/32 [200/0] via 192.168.5.20, eth1, 03:03:27
B *> 192.168.20.53/32 [200/0] via 192.168.5.20, eth1, 03:03:27
Thanks for the detailed report. I think we were able to figure out what is going on.
If we first look at the PCAPs of the scenario where LB traffic doesn't work, I see the following:
If we then look at the PCAPs of for the Pod IP where it does work, i see the following:
Now that node has the ARP record for the client the LB works since the node will at that point start using the ARP record for return traffic for the LB VIP as well.
So we have 3 issues here:
My suggestion to resolve this issue is to make a dedicated CIDR for clients, lets say 192.168.6.0/24, this should force traffic in both directions to go via the router. Then make sure return traffic can reach the clients, which should be easier to diagnose in that setup since you won't have packets bypassing your router.
Lets keep this issue as a bug report for the incorrect handling of return traffic. Some technical details:
After discussing it with @borkmann the conclusion is that the return traffic in this case will be processed by rev_nodeport_lb{4,6}. Here we perform a FIB lookup, we suspect that this results in a BPF_FIB_LKUP_RET_NO_NEIGH
response when no ARP entries exist. At this moment rev_nodeport_lb{4,6} will send traffic back to the neighbor from which we received the traffic in the first place, assuming they can handle the return traffic. It seems the more correct thing to do would be to call redirect_direct_v{4,6} which will invoke the redirect_neigh
helper in this case, which does perform the ARP lookup.
Thanks a lot for looking into this and providing all the details and help @dylandreimerink, I really appreciate it. I moved the Kubernetes nodes over to a different subnet (10.0.5.0/24
, the LBs on 10.0.20.0/24
), and this cleared up the issue I think. There are no longer any clients listed in the Kubernetes node's ARP cache, and there's no ARP lookups. So I believe I'm now running properly on L3 only.
My problem is solved but please feel free to keep the issue open as needed.
Here's a quick tcpdump
of the DNS request on the Kubernetes node side now:
11:16:10.398689 enp2s0 In IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [S], seq 505666507, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1548694393 ecr 0,sackOK,eol], length 0
11:16:10.398778 lxceeafc8fbb3d2 In IP 10.244.0.154.53 > 192.168.5.157.52932: Flags [S.], seq 3471219395, ack 505666508, win 64308, options [mss 1410,sackOK,TS val 2074525835 ecr 1548694393,nop,wscale 8], length 0
11:16:10.398807 enp2s0 Out IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [S.], seq 3471219395, ack 505666508, win 64308, options [mss 1410,sackOK,TS val 2074525835 ecr 1548694393,nop,wscale 8], length 0
11:16:10.399774 enp2s0 In IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 1, win 2053, options [nop,nop,TS val 1548694395 ecr 2074525835], length 0
11:16:10.416055 enp2s0 In IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [P.], seq 1:42, ack 1, win 2053, options [nop,nop,TS val 1548694411 ecr 2074525835], length 41 29579+ [1au] A? debian.org. (39)
11:16:10.416120 lxceeafc8fbb3d2 In IP 10.244.0.154.53 > 192.168.5.157.52932: Flags [.], ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 0
11:16:10.416156 enp2s0 Out IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [.], ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 0
11:16:10.416697 lxceeafc8fbb3d2 In IP 10.244.0.154.53 > 192.168.5.157.52932: Flags [P.], seq 1:90, ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 89 29579 3/0/1 A 130.89.148.77, A 128.31.0.62, A 149.20.4.15 (87)
11:16:10.416738 enp2s0 Out IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [P.], seq 1:90, ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 89 29579 3/0/1 A 130.89.148.77, A 128.31.0.62, A 149.20.4.15 (87)
11:16:10.417063 enp2s0 In IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
11:16:10.418150 enp2s0 In IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [F.], seq 42, ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
11:16:10.419379 lxceeafc8fbb3d2 In IP 10.244.0.154.53 > 192.168.5.157.52932: Flags [F.], seq 90, ack 43, win 252, options [nop,nop,TS val 2074525856 ecr 1548694412], length 0
11:16:10.419411 enp2s0 Out IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [F.], seq 90, ack 43, win 252, options [nop,nop,TS val 2074525856 ecr 1548694412], length 0
11:16:10.420221 enp2s0 In IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 91, win 2051, options [nop,nop,TS val 1548694415 ecr 2074525856], length 0
and here is the same request on the router:
19:16:10.398087 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [S], seq 505666507, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1548694393 ecr 0,sackOK,eol], length 0
19:16:10.398316 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [S], seq 505666507, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1548694393 ecr 0,sackOK,eol], length 0
19:16:10.398622 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [S.], seq 3471219395, ack 505666508, win 64308, options [mss 1410,sackOK,TS val 2074525835 ecr 1548694393,nop,wscale 8], length 0
19:16:10.398786 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [S.], seq 3471219395, ack 505666508, win 64308, options [mss 1410,sackOK,TS val 2074525835 ecr 1548694393,nop,wscale 8], length 0
19:16:10.399214 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 1, win 2053, options [nop,nop,TS val 1548694395 ecr 2074525835], length 0
19:16:10.399408 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 1, win 2053, options [nop,nop,TS val 1548694395 ecr 2074525835], length 0
19:16:10.415922 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [.], ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 0
19:16:10.416112 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [.], ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 0
19:16:10.417662 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [F.], seq 42, ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
19:16:10.417834 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [F.], seq 42, ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
19:16:10.419187 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [F.], seq 90, ack 43, win 252, options [nop,nop,TS val 2074525856 ecr 1548694412], length 0
19:16:10.419359 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [F.], seq 90, ack 43, win 252, options [nop,nop,TS val 2074525856 ecr 1548694412], length 0
and finally the client:
11:16:10.249998 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [S], seq 505666507, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1548694393 ecr 0,sackOK,eol], length 0
11:16:10.250989 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [S.], seq 3471219395, ack 505666508, win 64308, options [mss 1410,sackOK,TS val 2074525835 ecr 1548694393,nop,wscale 8], length 0
11:16:10.251189 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 1, win 2053, options [nop,nop,TS val 1548694395 ecr 2074525835], length 0
11:16:10.267714 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [P.], seq 1:42, ack 1, win 2053, options [nop,nop,TS val 1548694411 ecr 2074525835], length 41 29579+ [1au] A? debian.org. (39)
11:16:10.268309 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [.], ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 0
11:16:10.268706 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [P.], seq 1:90, ack 42, win 252, options [nop,nop,TS val 2074525853 ecr 1548694411], length 89 29579 3/0/1 A 130.89.148.77, A 128.31.0.62, A 149.20.4.15 (87)
11:16:10.268783 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
11:16:10.269669 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [F.], seq 42, ack 90, win 2051, options [nop,nop,TS val 1548694412 ecr 2074525853], length 0
11:16:10.271607 IP 10.0.20.53.53 > 192.168.5.157.52932: Flags [F.], seq 90, ack 43, win 252, options [nop,nop,TS val 2074525856 ecr 1548694412], length 0
11:16:10.271895 IP 192.168.5.157.52932 > 10.0.20.53.53: Flags [.], ack 91, win 2051, options [nop,nop,TS val 1548694415 ecr 2074525856], length 0
Thanks for testing, yes, this clears it up given traffic goes via default gw. Beginning of Jan, I'm planning to consolidate most of the fib functionality to address issues such as this one.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
@eripa The https://github.com/cilium/cilium/pull/23884 PR got merged, could you recheck with master if your issue is addressed? Thanks for your help!
Is there an existing issue for this?
What happened?
Hello,
I have an issue with Cilium running with direct routing with BGP control plane in kube-proxy free mode. Please note that I'm using a pre-release version of Cilium in order to be able to use the BGP control plane to announce LoadBalancer IP, which isn't part of v1.13-rc3 (
quay.io/cilium/cilium-ci:dab8723c01c94998fd082ae1630a58f62f19658f
).I did experience the same issue with the MetalLB based approach, and upgraded in the hope that it would improve my situation. Once v1.13-rc4 is out I can switch to that version.
I have a suspicion that it's related to ARP and neighbour management (see
ip neigh show
below). But I'm not sure. If I remove the neighbour (sudo ip neigh del 192.168.5.157 lladdr 64:4b:f0:02:00:b3 dev enp2s0
) it stops working in a seemingly same way.When the client isn't in the nodes
neighbour
list, and traffic to theLoadBalancer
is not working, any query to the Pod IP will work (and make the LoadBalancer work again, adding it to the neighbour list).Please advice. Thank you! :pray:
Edit: a similar setup, but using k3s, metallb and kube-proxy w/ iptables works fine with LoadBalancer IPs always being reachable.
Problem statement
Overall the setup seems to work, but
LoadBalancer
access stops working after a period of inactivity (seemingly as sessions expire), until a new session is established with the node in some other way, like SSH to the node or curl/dig to a Pod IP on the same Kubernetes node.If any session has been established outside hitting the
Loadbalancer
directly, then traffic works fine, but if the client goes offline for a while, and then comes back, the client can no longer reach the LoadBalancer IP. It only starts working again after that client has initialized another session with the node Internal IP or Pod IP.There are no Network Policies involved.
From what I can tell the main issue is a) clients fail to establish a fresh session with a LoadBalancer IP and b) if a session is established in some other way (like accessing node/pod ip), then the session will eventually expire and the loadbalancer traffic breaks.
Worth noting is that other clients that have established sessions still works fine.
Details
Both clients and Kubernets workers lives on the same network subnet,
192.168.5.0/24
. The Kubernetes workers announce the Pod CIDR and Load Balancer CIDR using the BGP Control Plane (not metallb).No tunnel, direct routing, kube-proxy-free setup using BGP.
Non working session
If I try to access the service from a Client that has been offline for a while:
I can observe the following using
hubble observe
, where the TCP handshake arrives, but the client never sees the SYN-ACK and thus retransmits the SYN:Using
tcpdump
on the kubernetes worker (the router shows the same), I can see that the request comes in, and theSYN-ACK
goes out. But the client never receives it, thus retransmits the package (Wireshark excerpt from pcap):Edit: captured new dumps
~TCP dump pcap: pcap-not-working.zip~
pcap of trying a "fresh" (not working) DNS lookup attempt to the
LoadBalancer
:The client never receives the
SYN-ACK
, i.e. I cannot see this package intcpdump
on the client.pcap of the same request (working), but made directly to the
PodIP
(behind the sameLoadBalanacer
):In this capture, we can see that there's an ARP request, which populates the neighbour list on the Kubernetes worker.
I can see the routes being properly announced via BGP to the router:
Working session
If initiate a session with the node itself,
192.168.5.20
, such asssh
or directly to a Pod IP. It starts working and works until the session expires.Client:
and
hubble observe
:Cilium Version
Please note that I'm using a pre-release version of Cilium in order to be able to use the BGP control plane to announce LoadBalancer IP, which isn't part of v1.13-rc3 (
quay.io/cilium/cilium-ci:dab8723c01c94998fd082ae1630a58f62f19658f
).I needed the improvement in #22397 to fully be able to replace the metalLB based BGP implementation.
Kernel Version
Debian 11 with backports kernel:
Kubernetes Version
Sysdump
The original zip file was too large, so I recompressed it using
xz
and then created azip
out of that because GitHub doesn't acceptxz
files. I hope this is OK. :sad:cilium-sysdump-20221217-131223.tar.xz.zip
Relevant log output
Anything else?
linux neighbours
while having the problem on a client with IP
192.168.5.157
I cannot see it in theip neigh
on the worker node:Once I initiate a session to the node, I can observe the client showing up on the hosts neighbour list:
Cilium configuration
configuration options
Using Helm, relevant values:
status
BGP configuration
Code of Conduct