cloudius-systems / osv

OSv, a new operating system for the cloud.
osv.io
Other
4.05k stars 603 forks source link

Rerouting localhost traffic for k8s pod functionality #1311

Open togoetha opened 2 weeks ago

togoetha commented 2 weeks ago

Short situation sketch: I'm currently working on mixed-runtime pod networking (Kubernetes compatible) as continued work on Feather, which runs OSv on KVM with default bridge networking.

To do full pod networking I have to mess with localhost traffic so it gets forwarded to other workloads outside the vm. The solution is eBPF based, but obviously I can't go running an eBPF program in a unikernel so I have to intercept host side. At first glance it looks like I can simply customize the kernel to not start the loopback device so 127.0.0.1 gets sent to the default route, at which point i can TUN in on it (pun intended).

So my two issues:

Thanks for any info on this!

togoetha commented 2 weeks ago

Went over the options w.r.t. other runtimes, seems the easiest kludge is to modify the hosts file so localhost refers to the bridge if (which is always the same IP), where an OSv specific eBPF program picks it up. And hope devs use "localhost" instead of 127.0.0.1 . Removing lo interface altogether would have probably resulted in a huge ARP mess.

Feel free to close if there's no other solutions.

wkozaczuk commented 2 weeks ago

Hi,

I am not sure I fully understand your setup. I think your best bet is to experiment with the options you listed and see if it works. Feel free to see patches to illustrate what exactly you are trying to achieve.

BTW I will try to find time to look at Feather.

togoetha commented 2 weeks ago

Hopefully the diagram will clear it up a bit (IPv6 addressing, so I hope the IPv6 branch works out with my setup). Imagine there's a Kubernetes pod that contains a mix of container, unikernel and possibly wasm workloads. Various possible reasons try this, like security or performance. One property of being in a pod is that all workloads can reach each other at "localhost", so the main workload doesn't have to dial out of its environment for basic services like logging or database access. For containers, that's trivial because they can share a network namespace and the same loopback device.

Now the hard part is getting the "localhost" traffic from a microVM to a container network namespace and vice versa, since they don't share interfaces. For containers (and depending on runtime wasm) it's easy to hijack the loopback adapter ingress with an eBPF program, duplicate packets, and send them to the host network namespace to reroute them to the qemu-kvm bridge. But between the bridge and a unikernel is where it kind of breaks down at the moment, since I don't really see a way to translate the packets to "localhost" when they hit the network interface inside it (apart from modifying the hosts file); eBPF programs can operate on the tun interface though.

image

I'll try out some options and update when something works with minimal code changes. It's more like a "nice to have" for IoT edge computing anyways, so not an issue if it doesn't work out.