Preface: I have migrated a cluster from Calico to Cilium. Switched completely to eBPF mode. Removed all calico and kube-proxy components. Policy enforcement is set as 'default', meaning by default communications are allowed. Also enabled host firewall, but no cluster wide / host policies are present so far. Routing mode is left default - encapsulation. Full final values file will be attached in the end.
Issue: After the move, the communication to kubelet port 10250 to scrape metrics got broken. Neither metrics-server, nor prometheus pods can connect to this endpoint to collect data.
The error that metrics-server prints during the failure:
Things tried:
1) Removed all netpol and cnp resources in kube-system namespace to make sure metrics server is unrestricted.
2) Stopped UFW on the nodes to make sure it doesn't affect it. In the setup, UFW doesn't restrict backnet communications at all, all k8s nodes communicate with no blocking policies on OS level.
3) hubble observe to port 10250 show traffic is forwarded and not dropped.
4) Tried adding allowing CNP that didn't change anything:
5) Tried to tcpdump traffic on the node, it shows that the packet reaches cilium_net interface (for the node where both metrics-server and target kubelet reside). But confirmed with strace that kubelet pid does not receive this connection. So it gets lost somewhere between cilium_net and host routing.
6) It's important to note that communication to kube-apiserver (node ip and port 6443) works fine, but it's a pod running in hostNetwork mode on the same node.
Can I ask for any tips/hints what could cause such behavior? Or any troubleshooting items to try?
Cilium Version
cilium-cli: v0.15.3 compiled with go1.20.4 on darwin/arm64
cilium image (default): v1.13.4
cilium image (stable): v1.14.0
cilium image (running): 1.13.4
Root cause: IPAddressAllow inside /etc/systemd/system/kubelet.service that limited communications to kubelet. This was a result of cluster security hardening.
Is there an existing issue for this?
What happened?
Preface: I have migrated a cluster from Calico to Cilium. Switched completely to eBPF mode. Removed all calico and kube-proxy components. Policy enforcement is set as 'default', meaning by default communications are allowed. Also enabled host firewall, but no cluster wide / host policies are present so far. Routing mode is left default - encapsulation. Full final values file will be attached in the end.
Issue: After the move, the communication to kubelet port 10250 to scrape metrics got broken. Neither metrics-server, nor prometheus pods can connect to this endpoint to collect data.
The error that metrics-server prints during the failure:
Things tried: 1) Removed all netpol and cnp resources in kube-system namespace to make sure metrics server is unrestricted. 2) Stopped UFW on the nodes to make sure it doesn't affect it. In the setup, UFW doesn't restrict backnet communications at all, all k8s nodes communicate with no blocking policies on OS level. 3)
hubble observe
to port 10250 show traffic is forwarded and not dropped. 4) Tried adding allowing CNP that didn't change anything:5) Tried to
tcpdump
traffic on the node, it shows that the packet reaches cilium_net interface (for the node where both metrics-server and target kubelet reside). But confirmed withstrace
that kubelet pid does not receive this connection. So it gets lost somewhere betweencilium_net
and host routing. 6) It's important to note that communication to kube-apiserver (node ip and port 6443) works fine, but it's a pod running in hostNetwork mode on the same node.Can I ask for any tips/hints what could cause such behavior? Or any troubleshooting items to try?
Cilium Version
Kernel Version
Host OS is Ubuntu 22.04
Kubernetes Version
Sysdump
cilium-sysdump-20230731-161450.zip
Relevant log output
tcpdump on cilium_net interface
hubble observe:
Anything else?
cilium installation values:
Code of Conduct