Closed mtb-xt closed 7 months ago
Thanks for reporting! Like the message said, I was sure that this was going to come around to bite us sooner or later. 🤣
I'll try to take a look at it this week. 🙂
Sorry it took so long to wrap back around to this. Thanks for posting your output along with your daemonset configuration!
Looking at the log, it looks like you're having two distinct errors.
W0401 00:45:56.421160 20878 linux_networking.go:637] Able to see the following interfaces: enp1s0 kube-bridge kube-dummy-if lo veth0353ae9f veth454d77b5 veth47bce93a veth9da5270f vethc27ffbcc vethc557e15c vethcdf11df4
W0401 00:45:56.421190 20878 linux_networking.go:638] If one of the above is not eth0 it is likely, that the assumption that we've hardcoded in kube-router is wrong, please report this as a bug along with this output
E0401 00:45:56.421271 20878 hairpin_controller.go:46] unable to set hairpin mode for endpoint 172.21.100.21, its possible that hairpinning will not work as expected. Error was: failed to find the interface ID inside the container NS for endpoint IP: 172.21.100.21, due to: unable to read the ifaceID inside the container from /proc/4755/cwd/sys/class/net/eth0/iflink, output was: , error was: open /proc/4755/cwd/sys/class/net/eth0/iflink: no such file or directory
This one looks like it might be because the pod in question is in the HostNetwork. Are you able to confirm that 172.21.100.21
is a node's IP address?
If so, then we probably need to do some better work to exclude HostNetwork'd pods as I don't think hairpinning is needed for HostNetworked pods.
W0401 00:45:56.462925 20878 linux_networking.go:628] Could not list: /proc/5985/cwd/sys/class/net due to: open /proc/5985/cwd/sys/class/net: no such file or directory
W0401 00:45:56.463032 20878 linux_networking.go:629] If above error was 'no such file or directory' it may be that you haven't enabled 'hostPID=true' in your kube-router deployment
E0401 00:45:56.463135 20878 hairpin_controller.go:46] unable to set hairpin mode for endpoint 10.244.3.175, its possible that hairpinning will not work as expected. Error was: failed to find the interface ID inside the container NS for endpoint IP: 10.244.3.175, due to: unable to read the ifaceID inside the container from /proc/5985/cwd/sys/class/net/eth0/iflink, output was: , error was: open /proc/5985/cwd/sys/class/net/eth0/iflink: no such file or directory
This one is harder. I'm not sure why it wouldn't be able to access /proc/5985/cwd/sys/class/net/eth0/iflink
since you have hostPID: true
in your deployment. If you're still getting this error, would you be willing to look at the current pid you're getting and try listing the directories in order both from within the kube-router container and from the node directly as root?
Something like:
ls /proc/<current_pid_here>
ls /proc/<current_pid_here>/cwd
ls /proc/<current_pid_here>/cwd/sys
ls /proc/<current_pid_here>/cwd/sys/net
...
kubectl exec -ti -n kube-system <kube-router-pod> -- /bin/bash
ls /proc/<current_pid_here>
ls /proc/<current_pid_here>/cwd
ls /proc/<current_pid_here>/cwd/sys
ls /proc/<current_pid_here>/cwd/sys/net
It would also be helpful to know a bit more about what the process behind that PID is.
For the records: I have experienced it too on my v2 trial (which failed) - but I didn't investigate at all: after finding this bug report I just subscribed and decided to wait and see how it goes :-)
When I say "experienced it too" - it was both symptoms: 1) interface != eth0 2) not being able to find the pid while hostPID: true
.
If no one else provides more details - I will do, possibly in 1-2weeks from now. But given it's now at least 2 people with obviously 2 very different cluster configurations it should be reproducible for almost anybody?
Sorry it took so long to wrap back around to this. Thanks for posting your output along with your daemonset configuration!
No worries, let's see if we can figure it out.
Looking at the log, it looks like you're having two distinct errors.
The first one is:
This one looks like it might be because the pod in question is in the HostNetwork. Are you able to confirm that
172.21.100.21
is a node's IP address?
Yes, that's correct, 172.21.100.0/24 is the host network. I got thrown off by the message about eth0, because, I do not have such an interface.
If so, then we probably need to do some better work to exclude HostNetwork'd pods as I don't think hairpinning is needed for HostNetworked pods.
The second one is:
This one is harder. I'm not sure why it wouldn't be able to access
/proc/5985/cwd/sys/class/net/eth0/iflink
since you havehostPID: true
in your deployment. If you're still getting this error, would you be willing to look at the current pid you're getting and try listing the directories in order both from within the kube-router container and from the node directly as root?
Fresh errors from the log:
E0418 05:30:59.539044 2501977 hairpin_controller.go:46] unable to set hairpin mode for endpoint 10.244.3.12, its possible that hairpinning will not work as expected. Error was: failed to find the interface ID inside the container NS for endpoint IP: 10.244.3.12, due to: unable to read the ifaceID inside the container from /proc/2395058/cwd/sys/class/net/eth0/iflink, output was: , error was: open /proc/2395058/cwd/sys/class/net/eth0/iflink: no such file or directory
W0418 05:30:59.550023 2501977 linux_networking.go:628] Could not list: /proc/2397329/cwd/sys/class/net due to: open /proc/2397329/cwd/sys/class/net: no such file or directory
W0418 05:30:59.550052 2501977 linux_networking.go:629] If above error was 'no such file or directory' it may be that you haven't enabled 'hostPID=true' in your kube-router deployment
E0418 05:30:59.550071 2501977 hairpin_controller.go:46] unable to set hairpin mode for endpoint 10.244.3.18, its possible that hairpinning will not work as expected. Error was: failed to find the interface ID inside the container NS for endpoint IP: 10.244.3.18, due to: unable to read the ifaceID inside the container from /proc/2397329/cwd/sys/class/net/eth0/iflink, output was: , error was: open /proc/2397329/cwd/sys/class/net/eth0/iflink: no such file or directory
From the host:
root@sentinel-worker:/proc/2397329# ls /proc/2397329/cwd
lost+found store
And from the pod:
root@sentinel-worker:~ #ls /proc/2397329/cwd/
lost+found store
There's no sysfs in there?..
It would also be helpful to know a bit more about what the process behind that PID is. It's
ocisstoreserver
- part of OwnCloud Infinite Scale... https://github.com/owncloud/ocis deployed with their helm chart -
https://github.com/owncloud/ocis-charts/blob/v0.5.0/charts/ocis/templates/store/deployment.yaml
I've checked several other PIDs with errors - and they all don't have sysfs inside their cwd
Thanks for doing the legwork on this @mtb-xt. That really helps.
For the hostnetwork'd pods, we'll need to change the logic to ignore those.
For the containers without a sysfs, I wasn't really expecting a container to not have that. So I'll have to dig in and see if I can figure out what causes a container to either have or not have that. Maybe from scratch containers doesn't mount it in? Likely this will end up in a change of the error to warn with maybe some extra docs. Unfortunately, if it doesn't have a sysfs inside, then there isn't really a way that hairpinning in its current incarnation can work.
This should be fixed via #1657 which was added to kube-router release v2.1.1
Can you confirm if this fixed the errors that you were seeing? Additionally, please make sure that you have "hairpinMode":true
in your CNI configuration for kube-router. If you need to add this attribute, you'll need to restart kubelet after the change and recreate any pods that require hairpinning for it to function correctly.
This should be fixed via #1657 which was added to kube-router release v2.1.1
Can you confirm if this fixed the errors that you were seeing? Additionally, please make sure that you have
"hairpinMode":true
in your CNI configuration for kube-router. If you need to add this attribute, you'll need to restart kubelet after the change and recreate any pods that require hairpinning for it to function correctly.
It looks like it worked, I can't see the errors in the logs anymore.
What happened? I can see this output in my kube-router logs I'm trying to get DSR to work on my cluster.
What did you expect to happen? None of my interfaces are called 'eth0', unfortunately, so I'd expect kube-router to detect the interface name in some other way.
How can we reproduce the behavior you experienced? Steps to reproduce the behavior:
Screenshots / Architecture Diagrams / Network Topologies Standard BGP full mesh in kube-router.
System Information (please complete the following information):
kube-router --version
): v2.1.0, built on 2024-03-02T21:50:00+0000, go1.21.7kubectl version
) : 1.28.2Logs, other output, metrics Full pod logs and daemonset manifest: https://gist.github.com/mtb-xt/bbd297e0eb85b9c08c23cec1f21541f3
Additional context Note, that I'm not using k0s built in
kube-router
, but a generic kuberouter installation as a daemonset.