Open Fabian-K opened 2 months ago
I didn't get this issue on my env. Could you try to check if the traffic at least is rightly forwarded outside of the node when you do the test from the pod?
tcpdump -i eth0 -n port 443
Thank you for looking into this!
A successful execution of nc -4 -zv -w1 google.com 443
immediately results in
11:02:40.012163 IP 10.42.0.46.37702 > 142.250.186.78.443: Flags [S], seq 3633463507, win 64860, options [mss 1410,sackOK,TS val 2231993490 ecr 0,nop,wscale 7], length 0
11:02:40.015861 IP 142.250.186.78.443 > 10.42.0.46.37702: Flags [S.], seq 3656415161, ack 3633463508, win 65535, options [mss 1412,sackOK,TS val 4174670192 ecr 2231993490,nop,wscale 8], length 0
11:02:40.015910 IP 10.42.0.46.37702 > 142.250.186.78.443: Flags [.], ack 1, win 507, options [nop,nop,TS val 2231993493 ecr 4174670192], length 0
11:02:40.016076 IP 10.42.0.46.37702 > 142.250.186.78.443: Flags [F.], seq 1, ack 1, win 507, options [nop,nop,TS val 2231993494 ecr 4174670192], length 0
11:02:40.019749 IP 142.250.186.78.443 > 10.42.0.46.37702: Flags [F.], seq 1, ack 2, win 256, options [nop,nop,TS val 4174670195 ecr 2231993494], length 0
11:02:40.019793 IP 10.42.0.46.37702 > 142.250.186.78.443: Flags [.], ack 2, win 507, options [nop,nop,TS val 2231993497 ecr 4174670195], length 0
A failed execution of nc -4 -zv -w1 google.com 443
(I omitted the timeout) produces over time
11:09:03.832261 IP 10.42.0.46.49944 > 142.250.186.78.443: Flags [S], seq 520064118, win 64860, options [mss 1410,sackOK,TS val 2232377310 ecr 0,nop,wscale 7], length 0
11:09:04.863110 IP 10.42.0.46.49944 > 142.250.186.78.443: Flags [S], seq 520064118, win 64860, options [mss 1410,sackOK,TS val 2232378341 ecr 0,nop,wscale 7], length 0
11:09:05.887040 IP 10.42.0.46.49944 > 142.250.186.78.443: Flags [S], seq 520064118, win 64860, options [mss 1410,sackOK,TS val 2232379365 ecr 0,nop,wscale 7], length 0
11:09:06.911110 IP 10.42.0.46.49944 > 142.250.186.78.443: Flags [S], seq 520064118, win 64860, options [mss 1410,sackOK,TS val 2232380389 ecr 0,nop,wscale 7], length 0
11:09:07.935093 IP 10.42.0.46.49944 > 142.250.186.78.443: Flags [S], seq 520064118, win 64860, options [mss 1410,sackOK,TS val 2232381413 ecr 0,nop,wscale 7], length 0
11:09:08.959048 IP 10.42.0.46.49944 > 142.250.186.78.443: Flags [S], seq 520064118, win 64860, options [mss 1410,sackOK,TS val 2232382437 ecr 0,nop,wscale 7], length 0
If I understand it correctly, 142.250.186.78 is one of the IP addresses of the target server (here google). And for some reason sometimes no TCP connection can be established at all? 🤔
Where are you doing the tcpdump? From the pod or from the node? Could you do it from the node? You can omit your public IP address in case.
You are right, that was from the pod. Here are the results from the node. I however made two adjustments: the main network interface is enp5s0. Also, as there is quite some traffic for 443, I adjusted it to 587 and smtp.gmail.com (I originally noticed these issues when sending emails). There is no other traffic for this port on the server => tcpdump -i enp5s0 -n port 587
and nc -4 -zv smtp.gmail.com 587
A successful execution of nc -4 -zv smtp.gmail.com 587
immediately results in
15:55:19.684532 IP <PUBLIC IP>.63976 > 74.125.206.108.587: Flags [S], seq 2360491525, win 64860, options [mss 1410,sackOK,TS val 2537310217 ecr 0,nop,wscale 7], length 0
15:55:19.694849 IP 74.125.206.108.587 > <PUBLIC IP>.63976: Flags [S.], seq 3138931281, ack 2360491526, win 65535, options [mss 1412,sackOK,TS val 369829671 ecr 2537310217,nop,wscale 8], length 0
15:55:19.694929 IP <PUBLIC IP>.63976 > 74.125.206.108.587: Flags [.], ack 1, win 507, options [nop,nop,TS val 2537310227 ecr 369829671], length 0
15:55:19.695114 IP <PUBLIC IP>.63976 > 74.125.206.108.587: Flags [F.], seq 1, ack 1, win 507, options [nop,nop,TS val 2537310228 ecr 369829671], length 0
15:55:19.705899 IP 74.125.206.108.587 > <PUBLIC IP>.63976: Flags [.], ack 2, win 256, options [nop,nop,TS val 369829682 ecr 2537310228], length 0
15:55:19.706408 IP 74.125.206.108.587 > <PUBLIC IP>.63976: Flags [F.], seq 1, ack 2, win 256, options [nop,nop,TS val 369829682 ecr 2537310228], length 0
15:55:19.706446 IP <PUBLIC IP>.63976 > 74.125.206.108.587: Flags [.], ack 2, win 507, options [nop,nop,TS val 2537310239 ecr 369829682], length 0
A failed execution of nc -4 -zv smtp.gmail.com 587
produces over time
15:57:06.808798 IP <PUBLIC IP>.3566 > 74.125.206.108.587: Flags [S], seq 582598544, win 64860, options [mss 1410,sackOK,TS val 2537417341 ecr 0,nop,wscale 7], length 0
15:57:07.871077 IP <PUBLIC IP>.3566 > 74.125.206.108.587: Flags [S], seq 582598544, win 64860, options [mss 1410,sackOK,TS val 2537418404 ecr 0,nop,wscale 7], length 0
15:57:08.895179 IP <PUBLIC IP>.3566 > 74.125.206.108.587: Flags [S], seq 582598544, win 64860, options [mss 1410,sackOK,TS val 2537419428 ecr 0,nop,wscale 7], length 0
15:57:09.919173 IP <PUBLIC IP>.3566 > 74.125.206.108.587: Flags [S], seq 582598544, win 64860, options [mss 1410,sackOK,TS val 2537420452 ecr 0,nop,wscale 7], length 0
15:57:10.943151 IP <PUBLIC IP>.3566 > 74.125.206.108.587: Flags [S], seq 582598544, win 64860, options [mss 1410,sackOK,TS val 2537421476 ecr 0,nop,wscale 7], length 0
15:57:11.967181 IP <PUBLIC IP>.3566 > 74.125.206.108.587: Flags [S], seq 582598544, win 64860, options [mss 1410,sackOK,TS val 2537422500 ecr 0,nop,wscale 7], length 0
15:57:14.015171 IP <PUBLIC IP>.3566 > 74.125.206.108.587: Flags [S], seq 582598544, win 64860, options [mss 1410,sackOK,TS val 2537424548 ecr 0,nop,wscale 7], length 0
15:57:18.047198 IP <PUBLIC IP>.3566 > 74.125.206.108.587: Flags [S], seq 582598544, win 64860, options [mss 1410,sackOK,TS val 2537428580 ecr 0,nop,wscale 7], length 0
Not sure if that is relevant, but the server is a dedicated server hosted by Hetzner
I don't know about Hetzner but the traffic seems rightly forwarded outside the node by flannel. You aren't getting any reply from internet. Are there any configuration on the provider virtual network that could drop the traffic? I don't know probably some rules to avoid a DDoS attack.
Hmm... nothing that I´m aware of. What bugs me the most is that it is 100% reliable when running directly from the node. Anything from the provider side would also affect this, right? Only when running from within a pod, it starts to fail. It also only fails from the pod using IPv4, using IPv6 is also there 100% reliable. 🤔
I did a small test with 100 attempts:
From Node: nc -4 -zv google.com 443
100% reliable (100/100)
From Node: nc -6 -zv google.com 443
100% reliable (100/100)
From Pod: nc -4 -zv google.com 443
~50% reliable (53/100)
From Pod: nc -6 -zv google.com 443
100% reliable (100/100)
Could you try this from the node?
ethtool --offload eth0 rx off tx off
ethtool -K eth0 gso off
with the name of the interface instead of eth0
I think I found the reason, however, I can´t explain it fully yet. There is a firewall on the provider side that by default does not filter ipv6. This explains why it always works for IPv6 - both on the node and from the pod.
Next to some rules like only allowing incoming 80 and 443, this also contains by default an entry called "TCP established" with version ipv4, protocol TCP, target port 32768-65535, TCP-Flags ack -> action accept
. As soon as this entry is present, I see the behavior as described.
When I temporarily replace this with something like version ipv4, protocol TCP, target port 0-65535, TCP-Flags ack -> action accept
, the issue is resolved.
Is a different ephemeral port range than 32768-65535 used when the traffic comes from the pod via flannel? 🤔
For the traffic from the pods Flannel is only configuring a basic NAT with iptables and MASQUERADE.
In that context I found https://github.com/canonical/microk8s/issues/3909 describing the same issue with calico. It looks like that for some reason, a wider port range for ephemeral ports is used. 1024-65535 seem to work - matching https://datatracker.ietf.org/doc/html/rfc6056 🤔.
I currently don´t know where to follow up on this but at least it does not seem to be an issue exclusively with flannel.
Thanks a lot @rbrtbnfgl for the support! 🙏
Hi,
I installed k3s on top of ubuntu 24.04 using flannel vxlan (k3s config below). When connecting to external services using IPv4 from within a pod, the connections sometimes succeed and sometimes time out. When connecting to IPv6, it always works. Also, the same connections directly from the host always succeed (both using IPv4 and IPv6).
Unfortunately, my knowledge of networking is quite limited. Do you have any idea what could cause this behavior?
Thanks, Fabian
Connecting to google.com from a pod using IPv4 sometimes fails:
Connecting to google.com from a pod using IPv6 always works:
Connecting to google.com from the host using IPv4 and IPv6 always works:
Expected Behavior
Reliable connectivity from cluster to external service
Current Behavior
Frequent timeouts when connecting to external services using IPv4
Steps to Reproduce (for bugs)
nc -4 -zv -w1 google.com 443
from within a podContext
Your Environment