kube-router can't find eth0 interface

mtb-xt commented 8 months ago

What happened? I can see this output in my kube-router logs I'm trying to get DSR to work on my cluster.

I0401 00:45:56.355730 20878 network_policy_controller.go:163] Starting network policy controller W0401 00:45:56.421160 20878 linux_networking.go:637] Able to see the following interfaces: enp1s0 kube-bridge kube-dummy-if lo veth0353ae9f veth454d77b5 veth47bce93a veth9da5270f vethc27ffbcc vethc557e15c vethcdf11df4 W0401 00:45:56.421190 20878 linux_networking.go:638] If one of the above is not eth0 it is likely, that the assumption that we've hardcoded in kube-router is wrong, please report this as a bug along with this output E0401 00:45:56.421271 20878 hairpin_controller.go:46] unable to set hairpin mode for endpoint 172.21.100.21, its possible that hairpinning will not work as expected. Error was: failed to find the interface ID inside the container NS for endpoint IP: 172.21.100.21, due to: unable to read the ifaceID inside the container from /proc/4755/cwd/sys/class/net/eth0/iflink, output was: , error was: open /proc/4755/cwd/sys/class/net/eth0/iflink: no such file or directory

What did you expect to happen? None of my interfaces are called 'eth0', unfortunately, so I'd expect kube-router to detect the interface name in some other way.

How can we reproduce the behavior you experienced? Steps to reproduce the behavior:

Use on premises nodes with different interface names
Try to enable DSR

Screenshots / Architecture Diagrams / Network Topologies Standard BGP full mesh in kube-router.

System Information (please complete the following information):

Kube-Router Version (kube-router --version): v2.1.0, built on 2024-03-02T21:50:00+0000, go1.21.7
Kube-Router Parameters: [e.g. --run-router --run-service-proxy --enable-overlay --overlay-type=full etc.]
- --run-router=true
- --run-firewall=true
- --run-service-proxy=true
- --bgp-graceful-restart=true
- --hairpin-mode
- --metrics-port=9010
- --kubeconfig=/var/lib/kube-router/kubeconfig
- --runtime-endpoint=unix:///run/containerd/containerd.sock
Kubernetes Version (kubectl version) : 1.28.2
Cloud Type: on premise
Kubernetes Deployment Type: k0s
Kube-Router Deployment Type: DaemonSet
Cluster Size: 5 nodes

Logs, other output, metrics Full pod logs and daemonset manifest: https://gist.github.com/mtb-xt/bbd297e0eb85b9c08c23cec1f21541f3

Additional context Note, that I'm not using k0s built in kube-router, but a generic kuberouter installation as a daemonset.

aauren commented 8 months ago

Thanks for reporting! Like the message said, I was sure that this was going to come around to bite us sooner or later. 🤣

I'll try to take a look at it this week. 🙂

aauren commented 7 months ago

Sorry it took so long to wrap back around to this. Thanks for posting your output along with your daemonset configuration!

Looking at the log, it looks like you're having two distinct errors.

The first one is:

W0401 00:45:56.421160   20878 linux_networking.go:637] Able to see the following interfaces: enp1s0 kube-bridge kube-dummy-if lo veth0353ae9f veth454d77b5 veth47bce93a veth9da5270f vethc27ffbcc vethc557e15c vethcdf11df4 
W0401 00:45:56.421190   20878 linux_networking.go:638] If one of the above is not eth0 it is likely, that the assumption that we've hardcoded in kube-router is wrong, please report this as a bug along with this output
E0401 00:45:56.421271   20878 hairpin_controller.go:46] unable to set hairpin mode for endpoint 172.21.100.21, its possible that hairpinning will not work as expected. Error was: failed to find the interface ID inside the container NS for endpoint IP: 172.21.100.21, due to: unable to read the ifaceID inside the container from /proc/4755/cwd/sys/class/net/eth0/iflink, output was: , error was: open /proc/4755/cwd/sys/class/net/eth0/iflink: no such file or directory

This one looks like it might be because the pod in question is in the HostNetwork. Are you able to confirm that 172.21.100.21 is a node's IP address?

If so, then we probably need to do some better work to exclude HostNetwork'd pods as I don't think hairpinning is needed for HostNetworked pods.

The second one is:

W0401 00:45:56.462925   20878 linux_networking.go:628] Could not list: /proc/5985/cwd/sys/class/net due to: open /proc/5985/cwd/sys/class/net: no such file or directory
W0401 00:45:56.463032   20878 linux_networking.go:629] If above error was 'no such file or directory' it may be that you haven't enabled 'hostPID=true' in your kube-router deployment
E0401 00:45:56.463135   20878 hairpin_controller.go:46] unable to set hairpin mode for endpoint 10.244.3.175, its possible that hairpinning will not work as expected. Error was: failed to find the interface ID inside the container NS for endpoint IP: 10.244.3.175, due to: unable to read the ifaceID inside the container from /proc/5985/cwd/sys/class/net/eth0/iflink, output was: , error was: open /proc/5985/cwd/sys/class/net/eth0/iflink: no such file or directory

This one is harder. I'm not sure why it wouldn't be able to access /proc/5985/cwd/sys/class/net/eth0/iflink since you have hostPID: true in your deployment. If you're still getting this error, would you be willing to look at the current pid you're getting and try listing the directories in order both from within the kube-router container and from the node directly as root?

Something like:

ls /proc/<current_pid_here>
ls /proc/<current_pid_here>/cwd
ls /proc/<current_pid_here>/cwd/sys
ls /proc/<current_pid_here>/cwd/sys/net
...

kubectl exec -ti -n kube-system <kube-router-pod> -- /bin/bash
ls /proc/<current_pid_here>
ls /proc/<current_pid_here>/cwd
ls /proc/<current_pid_here>/cwd/sys
ls /proc/<current_pid_here>/cwd/sys/net

It would also be helpful to know a bit more about what the process behind that PID is.

zerkms commented 7 months ago

For the records: I have experienced it too on my v2 trial (which failed) - but I didn't investigate at all: after finding this bug report I just subscribed and decided to wait and see how it goes :-) When I say "experienced it too" - it was both symptoms: 1) interface != eth0 2) not being able to find the pid while hostPID: true.

If no one else provides more details - I will do, possibly in 1-2weeks from now. But given it's now at least 2 people with obviously 2 very different cluster configurations it should be reproducible for almost anybody?

mtb-xt commented 7 months ago

Sorry it took so long to wrap back around to this. Thanks for posting your output along with your daemonset configuration!

No worries, let's see if we can figure it out.

Looking at the log, it looks like you're having two distinct errors.

The first one is:

This one looks like it might be because the pod in question is in the HostNetwork. Are you able to confirm that 172.21.100.21 is a node's IP address?

Yes, that's correct, 172.21.100.0/24 is the host network. I got thrown off by the message about eth0, because, I do not have such an interface.

If so, then we probably need to do some better work to exclude HostNetwork'd pods as I don't think hairpinning is needed for HostNetworked pods.

The second one is:

This one is harder. I'm not sure why it wouldn't be able to access /proc/5985/cwd/sys/class/net/eth0/iflink since you have hostPID: true in your deployment. If you're still getting this error, would you be willing to look at the current pid you're getting and try listing the directories in order both from within the kube-router container and from the node directly as root?

Fresh errors from the log:

E0418 05:30:59.539044 2501977 hairpin_controller.go:46] unable to set hairpin mode for endpoint 10.244.3.12, its possible that hairpinning will not work as expected. Error was: failed to find the interface ID inside the container NS for endpoint IP: 10.244.3.12, due to: unable to read the ifaceID inside the container from /proc/2395058/cwd/sys/class/net/eth0/iflink, output was: , error was: open /proc/2395058/cwd/sys/class/net/eth0/iflink: no such file or directory
W0418 05:30:59.550023 2501977 linux_networking.go:628] Could not list: /proc/2397329/cwd/sys/class/net due to: open /proc/2397329/cwd/sys/class/net: no such file or directory
W0418 05:30:59.550052 2501977 linux_networking.go:629] If above error was 'no such file or directory' it may be that you haven't enabled 'hostPID=true' in your kube-router deployment
E0418 05:30:59.550071 2501977 hairpin_controller.go:46] unable to set hairpin mode for endpoint 10.244.3.18, its possible that hairpinning will not work as expected. Error was: failed to find the interface ID inside the container NS for endpoint IP: 10.244.3.18, due to: unable to read the ifaceID inside the container from /proc/2397329/cwd/sys/class/net/eth0/iflink, output was: , error was: open /proc/2397329/cwd/sys/class/net/eth0/iflink: no such file or directory

From the host:

root@sentinel-worker:/proc/2397329# ls /proc/2397329/cwd
lost+found  store

And from the pod:

root@sentinel-worker:~ #ls  /proc/2397329/cwd/
lost+found  store

There's no sysfs in there?..

It would also be helpful to know a bit more about what the process behind that PID is. It's ocisstoreserver - part of OwnCloud Infinite Scale... https://github.com/owncloud/ocis deployed with their helm chart -

https://github.com/owncloud/ocis-charts/blob/v0.5.0/charts/ocis/templates/store/deployment.yaml

I've checked several other PIDs with errors - and they all don't have sysfs inside their cwd

aauren commented 7 months ago

Thanks for doing the legwork on this @mtb-xt. That really helps.

For the hostnetwork'd pods, we'll need to change the logic to ignore those.

For the containers without a sysfs, I wasn't really expecting a container to not have that. So I'll have to dig in and see if I can figure out what causes a container to either have or not have that. Maybe from scratch containers doesn't mount it in? Likely this will end up in a change of the error to warn with maybe some extra docs. Unfortunately, if it doesn't have a sysfs inside, then there isn't really a way that hairpinning in its current incarnation can work.

aauren commented 7 months ago

This should be fixed via #1657 which was added to kube-router release v2.1.1

Can you confirm if this fixed the errors that you were seeing? Additionally, please make sure that you have "hairpinMode":true in your CNI configuration for kube-router. If you need to add this attribute, you'll need to restart kubelet after the change and recreate any pods that require hairpinning for it to function correctly.

mtb-xt commented 7 months ago

This should be fixed via #1657 which was added to kube-router release v2.1.1

Can you confirm if this fixed the errors that you were seeing? Additionally, please make sure that you have "hairpinMode":true in your CNI configuration for kube-router. If you need to add this attribute, you'll need to restart kubelet after the change and recreate any pods that require hairpinning for it to function correctly.

It looks like it worked, I can't see the errors in the logs anymore.

cloudnativelabs / kube-router