aojea / nat64

NAT64 implementation for Kubernetes deployments (mainly)
Apache License 2.0
28 stars 2 forks source link

Pod stuck in "syncing iptables rules..." #5

Closed alexandremahdhaoui closed 1 month ago

alexandremahdhaoui commented 1 month ago

Hi again 👋🏼

The nat64 pod is stuck syncing iptables rules and I was wondering if this had something to do with my set up. Also I'm unsure how to proceed from there, and how to debug it. Could the issue be related to the CNI I use? I.e. Cilium in native router mode.

Thoughts

Spec

What I tried?

Run a pod and execute the following commands:

Tcpdump while pinging github.com from the pod

15:30:10.177360 IP6 test.56362 > fc00--ffff-c1.kube-dns.kube-system.svc.cluster.local.53: 31114+ PTR? 4.0.9.7.2.5.c.8.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.b.9.f.f.4.6.0.0.ip6.arpa. (90)
15:30:10.192801 IP6 fc00--ffff-c1.kube-dns.kube-system.svc.cluster.local.53 > test.56362: 31114 2/0/0 CNAME 4.121.82.140.in-addr.arpa., PTR lb-140-82-121-4-fra.github.com. (270)
15:30:10.193481 IP6 test.47646 > fc00--ffff-8a.kube-dns.kube-system.svc.cluster.local.53: 3252+ PTR? 1.c.0.0.f.f.f.f.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.c.f.ip6.arpa. (90)
15:30:10.194304 IP6 fc00--ffff-8a.kube-dns.kube-system.svc.cluster.local.53 > test.47646: 3252*- 1/0/0 PTR fc00--ffff-c1.kube-dns.kube-system.svc.cluster.local. (228)
15:30:10.194594 IP6 test.35224 > fc00--ffff-c1.kube-dns.kube-system.svc.cluster.local.53: 44917+ PTR? 1.c.0.0.f.f.f.f.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.c.f.ip6.arpa. (90)
15:30:10.195077 IP6 fc00--ffff-c1.kube-dns.kube-system.svc.cluster.local.53 > test.35224: 44917*- 1/0/0 PTR fc00--ffff-c1.kube-dns.kube-system.svc.cluster.local. (228)
15:30:10.195629 IP6 test.46175 > fc00--ffff-c1.kube-dns.kube-system.svc.cluster.local.53: 32442+ PTR? a.8.0.0.f.f.f.f.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.c.f.ip6.arpa. (90)
15:30:10.196226 IP6 fc00--ffff-c1.kube-dns.kube-system.svc.cluster.local.53 > test.46175: 32442*- 1/0/0 PTR fc00--ffff-8a.kube-dns.kube-system.svc.cluster.local. (228)
15:30:10.196701 IP6 test.33040 > fc00--ffff-8a.kube-dns.kube-system.svc.cluster.local.53: 59768+ PTR? a.8.0.0.f.f.f.f.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.c.f.ip6.arpa. (90)
15:30:10.197153 IP6 fc00--ffff-8a.kube-dns.kube-system.svc.cluster.local.53 > test.33040: 59768*- 1/0/0 PTR fc00--ffff-8a.kube-dns.kube-system.svc.cluster.local. (228)
15:30:11.177096 IP6 test > lb-140-82-121-4-fra.github.com: ICMP6, echo request, seq 4, length 64
15:30:12.171213 IP6 fe80::103e:a2ff:fe1e:990b > fc00::ffff:da: ICMP6, neighbor solicitation, who has fc00::ffff:da, length 32
15:30:12.171313 IP6 fc00::ffff:da > fe80::103e:a2ff:fe1e:990b: ICMP6, neighbor advertisement, tgt is fc00::ffff:da, length 32
15:30:12.171606 IP6 test.46825 > fc00--ffff-8a.kube-dns.kube-system.svc.cluster.local.53: 40367+ PTR? a.d.0.0.f.f.f.f.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.c.f.ip6.arpa. (90)
15:30:12.177282 IP6 test > lb-140-82-121-4-fra.github.com: ICMP6, echo request, seq 5, length 64
15:30:12.187705 IP6 fc00--ffff-8a.kube-dns.kube-system.svc.cluster.local.53 > test.46825: 40367 NXDomain 0/1/0 (166)
15:30:12.188168 IP6 test.52094 > fc00--ffff-c1.kube-dns.kube-system.svc.cluster.local.53: 64507+ PTR? b.0.9.9.e.1.e.f.f.f.2.a.e.3.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.8.e.f.ip6.arpa. (90)
15:30:12.203831 IP6 fc00--ffff-c1.kube-dns.kube-system.svc.cluster.local.53 > test.52094: 64507 NXDomain 0/1/0 (166)
15:30:13.177465 IP6 test > lb-140-82-121-4-fra.github.com: ICMP6, echo request, seq 6, length 64
15:30:14.177657 IP6 test > lb-140-82-121-4-fra.github.com: ICMP6, echo request, seq 7, length 64
15:30:15.177875 IP6 test > lb-140-82-121-4-fra.github.com: ICMP6, echo request, seq 8, length 64
15:30:16.178073 IP6 test > lb-140-82-121-4-fra.github.com: ICMP6, echo request, seq 9, length 64

Pinging from the host

Well pinging github.com (64:ff9b::8c52:7903) from the host is also stuck. By running tcpdump on the main interface and on the nat64 one, I can see that all packets goes through the nat64 interface 64:ff9b::8c52:7903 and are then stucked.

Logs

❯ kubectl logs -nkube-system   nat64-gd956   -f
2024/09/05 14:57:50 detected enp0s31f6 as default gateway interface
2024/09/05 14:57:50 create NAT64 interface nat64 with networks 169.254.64.0/24 and 64:ff9b::/96
2024/09/05 14:57:50 NAT64 interface with name nat64 not found, creating it
2024/09/05 14:57:50 starting metrics server listening in 0.0.0.0:8881
2024/09/05 14:57:50 NAT64 interface with name nat64 down, setting it up
2024/09/05 14:57:50 replacing addresses [] on interface nat64 with 169.254.64.0/24
2024/09/05 14:57:50 replacing addresses [fe80::1c6f:d7ff:feeb:6ea1/64] on interface nat64 with 64:ff9b::/96
2024/09/05 14:57:50 eBPF program spec section tc/nat46 name nat46
2024/09/05 14:57:50 eBPF program spec section tc/nat64 name nat64
2024/09/05 14:57:52 adding eBPF nat64 prog to the interface nat64
2024/09/05 14:57:52 adding eBPF nat46 prog to the interface nat64
2024/09/05 14:57:53 NAT64 initialized
2024/09/05 15:02:52 syncing iptables rules ...
aojea commented 1 month ago

icmp and icmpv6 are .... complicated, hence we didn't implemented (yet) ... @siwiutki had some branch with a prototype

Can you test with tcp or udp? do curl -v www.github.com per example

syncing iptables is every 5 mins by default, so it may not be the problem, let's validate TCP and UDP NAT64 works

alexandremahdhaoui commented 1 month ago

Thanks for the suggestion, I tried it out and you'll find the results displayed below.

Additionally, I'd be happy to help develop a solution and contribute to the project if possible.

Tcpdump with curl or ssh

bash-5.0# curl -v github.com
*   Trying 64:ff9b::8c52:7904:80...
* connect to 64:ff9b::8c52:7904 port 80 failed: Operation timed out
* Failed to connect to 64:ff9b::8c52:7904 port 80 after 135490 ms: Operation timed out
* Closing connection 0
curl: (28) Failed to connect to 64:ff9b::8c52:7904 port 80 after 135490 ms: Operation timed out

Or

bash-5.0# ssh github.com

Yields the same output

22:37:01.058148 IP6 test.42760 > lb-140-82-121-4-fra.github.com.80: Flags [S], seq 2428981892, win 64800, options [mss 1440,sackOK,TS val 2022456602 ecr 0,nop,wscale 7], length 0
22:37:02.081953 IP6 test.42760 > lb-140-82-121-4-fra.github.com.80: Flags [S], seq 2428981892, win 64800, options [mss 1440,sackOK,TS val 2022457626 ecr 0,nop,wscale 7], length 0
22:37:04.130980 IP6 test.42760 > lb-140-82-121-4-fra.github.com.80: Flags [S], seq 2428981892, win 64800, options [mss 1440,sackOK,TS val 2022459675 ecr 0,nop,wscale 7], length 0
22:37:08.161973 IP6 test.42760 > lb-140-82-121-4-fra.github.com.80: Flags [S], seq 2428981892, win 64800, options [mss 1440,sackOK,TS val 2022463706 ecr 0,nop,wscale 7], length 0

What I try to achieve

xx@host:~$ argocd repo add git@github.com:OWNER/REPO.git --ssh-private-key-path [...]

127.0.0.1:43459ERRO[0133] finished unary call with code Unknown         error="rpc error: code = Unknown desc = error testing repository connectivity: dial tcp [64:ff9b::8c52:7904]:22: connect: connection timed out" grpc.code=Unknown grpc.method=ValidateAccess grpc.service=repository.RepositoryService grpc.start_time="2024-09-05T22:23:40Z" grpc.time_ms=133537.72 span.kind=server system=grpc
FATA[0133] rpc error: code = Unknown desc = error testing repository connectivity: dial tcp [64:ff9b::8c52:7904]:22: connect: connection timed out 
aojea commented 1 month ago

(please use the -n flag in tcpdump to get the IP address on the traces)

Ok, let's do a sanity check first, execute from the node directly: curl -v -k https://140.82.121.4

aojea commented 1 month ago

any progress @alexandremahdhaoui

alexandremahdhaoui commented 1 month ago

Hi @aojea thanks a lot for your help, but I decided to change my setup and rebuild my cluster with dual stack instead.

I will try again to set up ipv6 only with the nat64 DS in the future.

Until then, I will close this issue.

alexandremahdhaoui commented 1 month ago

Thanks a lot for your support.

aojea commented 1 month ago

my pleasure, thanks