Missing ICMPv6 time exceeded in-transit from Node

juliusrickert commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

What happened?

When trying to traceroute towards a Pod running on our cluster, the hop before the Pod, i.e. the Node, is not responding with ICMP6, time exceeded in-transit.

Our cluster is using eBPF for everything. See below in section "Anything else?" for more information.

Cilium Version

Client: 1.12.4 6eaecaf 2022-11-16T05:45:01+00:00 go version go1.18.8 linux/amd64 Daemon: 1.12.4 6eaecaf 2022-11-16T05:45:01+00:00 go version go1.18.8 linux/amd64

Kernel Version

Linux k8s0-controlplane0 6.0.0-5-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.0.10-2 (2022-12-01) x86_64 Linux

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-08T19:51:43Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"darwin/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-08T19:51:45Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

cilium-sysdump-20221213-140237.zip

Relevant log output

No response

Anything else?

This bug could affect IPv4 too, but IPv4 addresses are probably too expensive for anyone to run IPv4 clusters without masquerading :D

Our hosts have CiliumClusterwideNetworkPolicies configured which allow them to respond to ICMPv4 and ICMPv6 echo requests.

cilium status:

KVStore:                 Ok   Disabled
Kubernetes:              Ok   1.26 (v1.26.0) [linux/amd64]
Kubernetes APIs:         ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:    Strict    [internet 192.0.2.1 2001:db8::1 (Direct Routing)]
Host firewall:           Enabled   [internet]
CNI Chaining:            none
Cilium:                  Ok   1.12.4 (v1.12.4-6eaecaf)
NodeMonitor:             Listening for events on 4 CPUs with 64x4096 of shared memory
Cilium health daemon:    Ok   
IPAM:                    IPv4: 4/254 allocated from 10.0.2.0/24, IPv6: 4/65534 allocated from 2001:db8:1f::2:0/112
BandwidthManager:        EDT with BPF [BBR] [internet]
Host Routing:            BPF
Masquerading:            BPF   [internet]   10.0.0.0/8 [IPv4: Enabled, IPv6: Disabled]
Controller Status:       30/30 healthy
Proxy Status:            OK, ip 10.0.2.231, 0 redirects active on ports 10000-20000
Global Identity Range:   min 256, max 65535
Hubble:                  Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 45.90   Metrics: Ok
Encryption:              Disabled
Cluster health:          3/3 reachable   (2022-12-08T00:45:43Z)

Keywords: hops exceeded, TTL, tracert, mtr

Code of Conduct

[X] I agree to follow this project's Code of Conduct

pchaigno commented 1 year ago

Could you share a sysdump of the cluster and reproduction steps?

juliusrickert commented 1 year ago

Updated initial report to add a Drive link to the sysdump. It is generated on a similarly configured cluster with the same issue.

Reproduction steps

Create (dual-stack) kubeadm cluster (single-node cluster is sufficient)
Add Cilium as the CNI with the following Helm values:

autoDirectNodeRoutes: true
bandwidthManager:
  bbr: true
  enabled: true
bpf:
  masquerade: true
enableIPv6Masquerade: false
hostFirewall:
  enabled: true
ipam:
  operator:
    clusterPoolIPv4MaskSize: 24
    clusterPoolIPv4PodCIDRList:
    - <IPv4 CIDR>
    clusterPoolIPv6MaskSize: 120
    clusterPoolIPv6PodCIDRList:
    - <IPv6 CIDR>
ipv4:
  enabled: true
ipv4NativeRoutingCIDR: 10.0.0.0/8
ipv6:
  enabled: true
ipv6NativeRoutingCIDR: ::/0
k8sServiceHost: <host>
k8sServicePort: <port>
kubeProxyReplacement: strict
loadBalancer:
  mode: dsr
operator:
  replicas: 1
tunnel: disabled

Traceroute to a Pod, e.g. the CoreDNS Pod, from outside of the cluster

pchaigno commented 1 year ago

Do you still have this issue if you disable BPF masquerading and the Host Firewall?

juliusrickert commented 1 year ago

Disabled both host firewall and BPF masquerading (and BBR for the bandwidth manager because it relies on BPF). The issue persists.

IPv6 should be unaffected by the BPF masquerading, no?

pchaigno commented 1 year ago

IPv6 should be unaffected by the BPF masquerading, no?

Yep, good point given we don't currently support BPF masquerading for IPv6.

Are you doing the traceroute toward the pod IP address or some service with the pod as a backend? Do you see any drops reported by Cilium?

juliusrickert commented 1 year ago

Are you doing the traceroute toward the pod IP address or some service with the pod as a backend?

Towards the Pod IP from a machine outside of the cluster's Nodes and their L2 network. (Nodes are on the same L2 network.)

 9  <router's transport network>::2  33.091 ms  33.012 ms  32.916 ms
10  <network  which contains Node>::112  11.002 ms  17.246 ms  10.986 ms
11  * * *
12  <network of the Pods of the Node>::68  11.257 ms  10.743 ms  10.517 ms

I just realised that the Node is indeed responding (hop 10) if BPF masquerading is disabled. There is still another hop between the Node (10) and the Pod (12).

The Node's hop is missing when BPF masquerading is enabled:

 9  <router's transport network>::2  49.338 ms  49.278 ms  49.064 ms
10  * * *
11  <network of the Pods of the Node>::68  14.774 ms  10.840 ms  13.030 ms

Do you see any drops reported by Cilium?

I didn't see any drops reported by Hubble. But I may have not been looking in the right place.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

juliusrickert commented 1 year ago

@pchaigno let me know if you need any additional information.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

pchaigno commented 1 year ago

Next step would probably be to check for packet drops in the Linux stack with pwru.

juliusrickert commented 1 year ago

Apologies for the delay! I wasn't able to make time to debug this issue further.

I have noticed that this issue is reproducible when tracerouting from one Pod to another. When tracerouting to a Pod on the same Node, regardless of IPv6 or IPv4, there is a single hop between the Pods that is not responding. Similarly, when tracerouting to a Pod on a different Node, there are two hops which aren't replying.

Although I haven't been able to dig deeper, I hope this information may help to narrow down the issue.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 1 year ago

This issue has not seen any activity since it was marked stale. Closing.

julianwiedmann commented 1 year ago

IPv6 should be unaffected by the BPF masquerading, no?

Even without IPV6 BPF Masquerading, we would still track host-originating traffic if it potentially conflicts with Nodeport NAT connections to remote backends.

The Node's hop is missing when BPF masquerading is enabled:

This is most likely caused by the BPF SNAT logic dropping the ICMPV6_TIME_EXCEED packet: https://github.com/cilium/cilium/blob/fbbca8494eaa2417fdfe201cb4dcc262bb514ad6/bpf/lib/nat.h#L1755

julianwiedmann commented 1 year ago

https://github.com/cilium/cilium/pull/26674 might help here (it speaks about DSR-eligible traffic, but actually also skips over non-Service traffic such as ICMP).

Note that with BPF Masquerading enabled, we could still end up applying SNAT to the traffic (and hit the same issue), depending on what IPV6_MASQUERADE is selected: https://github.com/cilium/cilium/blob/fbbca8494eaa2417fdfe201cb4dcc262bb514ad6/bpf/lib/nat.h#L1619

juliusrickert commented 1 year ago

Thank you for looking into this issue and pinning it, @julianwiedmann. Appreciate it!

Unfortunately, I don't have time at hand to test fixes at the moment. Have you been able to reproduce the issue?

cilium / cilium