Open juliusrickert opened 1 year ago
Could you share a sysdump of the cluster and reproduction steps?
Updated initial report to add a Drive link to the sysdump. It is generated on a similarly configured cluster with the same issue.
autoDirectNodeRoutes: true
bandwidthManager:
bbr: true
enabled: true
bpf:
masquerade: true
enableIPv6Masquerade: false
hostFirewall:
enabled: true
ipam:
operator:
clusterPoolIPv4MaskSize: 24
clusterPoolIPv4PodCIDRList:
- <IPv4 CIDR>
clusterPoolIPv6MaskSize: 120
clusterPoolIPv6PodCIDRList:
- <IPv6 CIDR>
ipv4:
enabled: true
ipv4NativeRoutingCIDR: 10.0.0.0/8
ipv6:
enabled: true
ipv6NativeRoutingCIDR: ::/0
k8sServiceHost: <host>
k8sServicePort: <port>
kubeProxyReplacement: strict
loadBalancer:
mode: dsr
operator:
replicas: 1
tunnel: disabled
Do you still have this issue if you disable BPF masquerading and the Host Firewall?
Disabled both host firewall and BPF masquerading (and BBR for the bandwidth manager because it relies on BPF). The issue persists.
IPv6 should be unaffected by the BPF masquerading, no?
IPv6 should be unaffected by the BPF masquerading, no?
Yep, good point given we don't currently support BPF masquerading for IPv6.
Are you doing the traceroute toward the pod IP address or some service with the pod as a backend? Do you see any drops reported by Cilium?
Are you doing the traceroute toward the pod IP address or some service with the pod as a backend?
Towards the Pod IP from a machine outside of the cluster's Nodes and their L2 network. (Nodes are on the same L2 network.)
9 <router's transport network>::2 33.091 ms 33.012 ms 32.916 ms
10 <network which contains Node>::112 11.002 ms 17.246 ms 10.986 ms
11 * * *
12 <network of the Pods of the Node>::68 11.257 ms 10.743 ms 10.517 ms
I just realised that the Node is indeed responding (hop 10) if BPF masquerading is disabled. There is still another hop between the Node (10) and the Pod (12).
The Node's hop is missing when BPF masquerading is enabled:
9 <router's transport network>::2 49.338 ms 49.278 ms 49.064 ms
10 * * *
11 <network of the Pods of the Node>::68 14.774 ms 10.840 ms 13.030 ms
Do you see any drops reported by Cilium?
I didn't see any drops reported by Hubble. But I may have not been looking in the right place.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
@pchaigno let me know if you need any additional information.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Next step would probably be to check for packet drops in the Linux stack with pwru.
Apologies for the delay! I wasn't able to make time to debug this issue further.
I have noticed that this issue is reproducible when tracerouting from one Pod to another. When tracerouting to a Pod on the same Node, regardless of IPv6 or IPv4, there is a single hop between the Pods that is not responding. Similarly, when tracerouting to a Pod on a different Node, there are two hops which aren't replying.
Although I haven't been able to dig deeper, I hope this information may help to narrow down the issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has not seen any activity since it was marked stale. Closing.
IPv6 should be unaffected by the BPF masquerading, no?
Even without IPV6 BPF Masquerading, we would still track host-originating traffic if it potentially conflicts with Nodeport NAT connections to remote backends.
The Node's hop is missing when BPF masquerading is enabled:
This is most likely caused by the BPF SNAT logic dropping the ICMPV6_TIME_EXCEED
packet:
https://github.com/cilium/cilium/blob/fbbca8494eaa2417fdfe201cb4dcc262bb514ad6/bpf/lib/nat.h#L1755
https://github.com/cilium/cilium/pull/26674 might help here (it speaks about DSR-eligible traffic, but actually also skips over non-Service traffic such as ICMP).
Note that with BPF Masquerading enabled, we could still end up applying SNAT to the traffic (and hit the same issue), depending on what IPV6_MASQUERADE
is selected:
https://github.com/cilium/cilium/blob/fbbca8494eaa2417fdfe201cb4dcc262bb514ad6/bpf/lib/nat.h#L1619
Thank you for looking into this issue and pinning it, @julianwiedmann. Appreciate it!
Unfortunately, I don't have time at hand to test fixes at the moment. Have you been able to reproduce the issue?
Is there an existing issue for this?
What happened?
When trying to
traceroute
towards a Pod running on our cluster, the hop before the Pod, i.e. the Node, is not responding withICMP6, time exceeded in-transit
.Our cluster is using eBPF for everything. See below in section "Anything else?" for more information.
Cilium Version
Client: 1.12.4 6eaecaf 2022-11-16T05:45:01+00:00 go version go1.18.8 linux/amd64 Daemon: 1.12.4 6eaecaf 2022-11-16T05:45:01+00:00 go version go1.18.8 linux/amd64
Kernel Version
Linux k8s0-controlplane0 6.0.0-5-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.0.10-2 (2022-12-01) x86_64 Linux
Kubernetes Version
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-08T19:51:43Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"darwin/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-08T19:51:45Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}
Sysdump
cilium-sysdump-20221213-140237.zip
Relevant log output
No response
Anything else?
This bug could affect IPv4 too, but IPv4 addresses are probably too expensive for anyone to run IPv4 clusters without masquerading :D
Our hosts have
CiliumClusterwideNetworkPolicies
configured which allow them to respond to ICMPv4 and ICMPv6 echo requests.cilium status
:Keywords: hops exceeded, TTL, tracert, mtr
Code of Conduct