majek commented 1 year ago

Hi,

Firecracker doesn't have fast enough networking for many users. Firecracker networking benchmarks claiming tens of Gbps use TCP and use network offloads (TSO). However, often for real users these assumptions can't be met. For example, QUIC uses UDP, and when using XDP in the guest, the offloads must be disabled.

Under such conditions - UDP and no offloads - firecracker network performance is dismal. We can get only 1-2Gbps. This is really understandable once you realize how the tap code works.

Other kvm-based virtualization systems can do much better, they typically do this by utilizing vhost-net host kernel acceleration.

I'm considering working on adding vhost-net support to firecracker. Vhost-net is a host kernel API, exposing tap device in a format compatible with virtio-net guest driver. With shared memory and ioctl's dance it's possible to greatly speed up the data path for networking. Host vhost-net kernel thread would copy network data directly into the kvm guest memory. It's possible to set it up in such a way that firecracker process is not touching neither the data nor interrupts when doing networking. This saves loads of context switches, greatly improves latency and reduces CPU usage.

It doesn't solve the scalability issue - a naive vhost-net implementation would still only be single-queue and use one vCPU in the guest. This is a separate issue.

Vhost-net is a well established API. From the previous discussions it feels the firecracker community was not in favor of vhost-net. Vhost-net exposes the host kernel, so it extends the attack surface against the host. However, over the years vhost-net matured, the implementation strengthened and given its benefits it might be a good time to reconsider it.

I think with vhost-net we should be able to increase the network performance for UDP + disabled offloads from 1-2Gbps to 7-8Gbps. Users of vhost-net would see a reduced CPU usage for networking-heavy virtual machines and improved latency.

When looking at adding vhost-net to firecracker I had some concerns:

(1) It's unclear how to support MMDS with vhost-net. Is that a problem?

(2) It's unclear how to deal with networking rate limits.

I think both of these features could be supported by doing ebpf on the host side. I'm actually surprised by software rate-limit implementation done in firecracker since it could be done much more efficiently in bpf. I'm not looking at vsock at this stage.

I would like to understand if:

the maintainers are generally open for adding optional vhost-net support
there are any extra conditions that must be met to consider the vhost-net code

In a perfect world I would like to have a vhost-net networking option in firecracker. Users prioritizing perceived security could use traditional data path, users prioritizing lower cpu usage and better network scalability could opt-in for vhost-net.

Marek

acatangiu commented 1 year ago

I guess the answer to whether vhost-net or vhost-block will be supported by firecracker comes down to this past discussion, and in particular this:

The main focus is on maintaining the Firecracker security barrier. I.e. Firecracker must control all data exchanged between the guest and the host.

and

Vhost-net exposes the host kernel, so it extends the attack surface against the host. However, over the years vhost-net matured, the implementation strengthened and given its benefits it might be a good time to reconsider it.

Doing a shallow search of vhost-related CVEs does show a downward trend over the years, so it has indeed matured - but the crux of the matter of having considerably more attack surface is still a pain point - also this "extra" attack surface is directly to host kernel so it's bypassing most other defense-in-depth layers of firecracker (de-privileged process, cgroups, namespaces, seccomp, etc) - so the risk while small has higher potential of big impact.

This is not to say I am personally against vhost-* emulation - it is indeed faster and more efficient. I am simply offering some background and context for the decision-makers.

bchalios commented 1 year ago

Hi Marek,

Thanks for getting in touch and the input, I think you are raising very valid points. On our side we know that IO performance in Firecracker is not what it could be and we have recently started internal discussions on defining the problems and sketching potential solutions. We plan to actively work on it during the second part of the year.

vhost is definitely one of the ways to go. While vhost-net is the easiest option, we have also been considering vhost-user. The latter, apart from promising performance advantages would give us a lot more flexibility in selecting networking back-ends more easily (no changes would be needed in Firecracker itself).

We will keep the community posted on updates on our side, but feedback at this point from you and others that are interested in this is valuable.

Some remarks on your comments:

Vhost-net is a well established API. From the previous discussions it feels the firecracker community was not in favor of vhost-net. Vhost-net exposes the host kernel, so it extends the attack surface against the host. However, over the years vhost-net matured, the implementation strengthened and given its benefits it might be a good time to reconsider it.

The security aspect of vhost-net is still important for us. vhost-net exposes the host kernel to guests, without Firecracker intermediation, so we need to think about it very carefully.

Under such conditions - UDP and no offloads - firecracker network performance is dismal. We can get only 1-2Gbps. This is really understandable once you realize how the tap code works.

It sounds that this might be orthogonal to vhost-net or no vhost-net discussion. Do you maybe have throughput numbers for UDP workloads where you just disable offloading in the Firecracker emulation code?

(1) It’s unclear how to support MMDS with vhost-net. Is that a problem?

At the moment MMDS is supported from Firecracker, so any re-architecture should take it into account. BPF is one way, another would be to decouple MMDS from the emulation of the Network device all together.

I would like to understand if:

the maintainers are generally open for adding optional vhost-net support

We are definitely open to discussion at this point for finding a proper solution for this.

there are any extra conditions that must be met to consider the vhost-net code

Rate limiting is a must have. MMDS too, at the time being. More requirements might occur once we dive deeper in the problem.

DemiMarie commented 1 year ago

One option would be vhost-user + DPDK + Vector Packet Processing.

kalyazin commented 1 year ago

Hi @majek. While considering support for vhost-net in Firecracker, we would like to gain confidence in the following areas.

over the years vhost-net matured, the implementation strengthened

One of the major concerns is vhost-net involves kernel in the virtqueue processing. If something goes wrong, the blast radius may be larger (potentially affecting other microVMs/tenants) compared to processing in userspace where only a single microVM/tenant would be affected. Do you have specific data that convinces you that vhost-net is sufficiently safe to use in a multi-tenant environment?

It's unclear how to deal with networking rate limits.

We would certainly need to keep the capability of rate limiting in the network device. Do you have some concrete ideas or references of how this can be implemented via eBPF or is it just an assumption?

It's unclear how to support MMDS with vhost-net. Is that a problem?

We would like to decouple MMDS from the networking stack. It probably does not make much sense to bind it to vhost-net, but we would need to have an alternative way of supporting it.

there are any extra conditions that must be met to consider the vhost-net code

It would also be very helpful to estimate the potential performance gain for various workloads. Have you been doing some analysis you could share that let you obtain the numbers you provided?

the maintainers are generally open for adding optional vhost-net support

If the concerns above are resolved, we would certainly be looking into supporting vhost-net.

Thanks for opening this discussion!

xmarcalx commented 1 year ago

Hi guys,

do you have any follow up on the previous questions? Are you still interested in this feature?

majek commented 1 year ago

Hi, a quick update. We have a hacky local firecracker fork with vhost networking. We're working on making the code upstreamable, it's harder than anticipated. We intend to: provide convincing benchmarks, provide code that could be used to bootstrap discussion. Long term we are likely to add multiqueue on top of that (which is non trivial considering mmio limitations).

Noah-Kennedy commented 1 year ago

A question for maintainers: would you prefer that this be implemented via a new device, or by putting multiple backends in the net device and selecting which one to use based on a setting in the vm config?

kalyazin commented 1 year ago

Hi @Noah-Kennedy . The path we would like to follow at the moment for the cases similar to this one is:

keep a single API endpoint for all devices of the kind (eg /network-interfaces).
use backend-specific parameters to differentiate devices within a kind (eg vhost object in the network interface properties to contain vhost-net-specific arguments)
use a separate structure for a device within a kind (eg src/vmm/src/devices/virtio/net/vhost/*). This may require moving the existing device one level down in the directory structure.

We are also thinking about introducing device subtypes similar to types to facilitate differentiation between devices within the kind.

We are hoping to publish a PR that would illustrate these points in a week or two.

kalyazin commented 1 year ago

Hi @Noah-Kennedy . There is a draft PR for adding a vhost-user block device: https://github.com/firecracker-microvm/firecracker/pull/4069 . Please let us know if a similar approach is not easily applicable to the vhost-net device for some reason.

DemiMarie commented 11 months ago

One option would be vhost-user + DPDK + Vector Packet Processing.

To elaborate on this:

An alternative approach to getting rid of context switches is to not switch contexts at all. Instead of having the networking stack on the same core as the user workload, run it on a different core, and rely on either asynchronous cross-core notifications or polling. DPUs take this to the extreme by running the networking code on completely different hardware.

kalyazin commented 10 months ago

Hi @majek . While our team are looking at the prerequisite PR, would you mind sharing your thoughts on my question earlier in the thread:

One of the major concerns is vhost-net involves kernel in the virtqueue processing. If something goes wrong, the blast radius may be larger (potentially affecting other microVMs/tenants) compared to processing in userspace where only a single microVM/tenant would be affected. Do you have specific data that convinces you that vhost-net is sufficiently safe to use in a multi-tenant environment?

While discussing internally, there have multiple opinions been voiced favouring the existing virtio net device or a potential vhost-user-net implementation over vhost-net for that exact reason.

majek commented 10 months ago

@kalyazin I am arguing here for a "less secure" design, so of course this is not going to be an easy sell.

The fundamental assumption is that firecracker without fast network is close to useless.

problem statement

With current design, network performance is a problem for me. It's not a problem for you, because my workload is different than yours - I need UDP. For offloaded TCP I agree, there is no performance issue. You can argue that firecracker is designed only for offloaded TCP workloads, and that would finish the discussion. However, that would be quite sad and disappointing.

vhost-net vs existing virtio net device

In current design all packets need to passes between tap->firecracker process->kvm, that severely limits pps. With offloads there are at most 13 packets in-flight, which might limit throughput further in multiple-concurrent-flows case.

I can supply benchmarks, but the vhost-net is dramatically faster than current design. This is especially for UDP and non-offloaded TCP.

Is the performance gap big enough to "compromise" on security? I would say yes. I'm not arguing for removal of classic code. I'm arguing that apart from default classic code, firecracker should have a vhost-net option. This is for for users that require this extra networking oomph. I believe it's totally possible to just support two types of network devices - classic, and vhost-net.

Is vhost-net insecure? Historically, it had major bugs, Some CVE searches vhost_net vhost-net vhost. But following typical software maturity, over last 3-4 years, the number of bugs dropped and the severity of issues reduced. I would argue that at this stage so many security professionals attempted to break vhost-net, that it's in the "reasonable for limited third-party users" category. Multiple hosting providers offer vhost-net accelerated networking to untrusted users. This is quite standard really, and it's surprising that firecracker doesnt do that yet. Surely, we don't need to enable all the possible feature flags on vhost-net. With vhost-net can live without advanced features like VIRTIO_F_RING_INDIRECT_DESC or VIRTIO_NET_F_MRG_RXBUF, in the name of lowering the complexity and attack surface on the kernel side.

From my point of view, the performance benefit is substantial enough to advocate for vhost-net. In our deployment we start off with classic net device, and upgrade specific users to faster vhost-net when the container is used enough. The alternative we have is qemu + vhost-net, which is less secure than firecracker + vhost-net.

In other words: firecracker + vhost-net would occupy a place in the landscape that is currently unoccupied. You can have slow firecracker when you have all the security but poor speed. You can have qemu, with all the speed but low security. There is little in-between. firecracker + vhost-net is a very reasonable compromise. You got all the security from traditional firecracker, compromise a bit on exposing quite mature and pentested vhost-net interface and get reasoanble speed. The best speed would be with vhost-net + multiqueue, but this is discussion for another time. (we have working MQ code internally, fyi)

vhost-net vs vhost-user-net

Using the qemu nomenclature, vhost-user requres 'front-end' software, usually dpdk or snabbswitch. They take over a network card and expose virtio devicess of it to containers.

This design, from security point of view is identical to vhost-net. In case of major problem, vhost-net leaks/exposes kernel. In case of vhost-user-net problem, the leak/rce is from front-end process. This is identical security model from our point of view. The disadvantage is that in my opinion, while front-end is a userspace process, it's less mature than linux kernel see cve. In other words:

vhost-user advantage -> front-end is a userspace process (this is not important to us, data leak from kernel is as bad as data leak from another tenant in our case)
vhost-user disadvantage -> front-end is likely way more buggy than kernel

Additionally, allocating a dedicated network card (even virtual) and a CPU for a cpu-spinning userspace front-end process is a no-go for us.

I would argue these resource requirements severely limits the pracitcal deployments of any vhost-user stack. We just don't have the spare NIC and CPU, and I would argue most firecracker users don't have this resources as well.

DemiMarie commented 10 months ago

@majek: Would virtio-net + multiqueue be fast enough for your worklaods? Could io_uring be used to accelerate the tap device I/O, or would that not be sufficient?

@kalyazin: Could the security concerns of vhost-net be alleviated by rewriting the kernel module in Rust? That would cause the parsing code to be in a memory-safe language and likely significantly reduce attack surface.

majek commented 10 months ago

@DemiMarie I would need to spend some time to reproduce the benchmarks, but from what I remember, roughly:

vanilla firecracker: 500Mbps of UDP traffic max
firecracker + vhost-net: 3-6gbps UDP (single queue)
firecracker + vhost-net+ multiqueue: > 6gbps, depending on specifics.

The benchmark we cared about was XDP + bpf_redirect inside the firecracker guest, so pretty much as simple as it gets for mirroring traffic. Surprisingly, this is actually quite close to the production use case we have. (not a theoretical benchmark)

Here, in this ticket, we are only discussing firecracker + vhost-net. Discussion about multiqueue is for another time. I would say that vanilla firecracker is good for simple TCP applications and lightweight workloads. firecracker + vhost-net is needed for UDP or heavier workloads. vhost-net + multiqueue is getting into "dedicated container for a machine" territory, (doing 10gbps+ on a single container). Let's not disucss MQ here. Let's focus on building a strong case for firecracker + vhost-net. I think the 3-6Gbps UDP range is really a solid argument for considering it.

DemiMarie commented 10 months ago

@majek The reason I mention multiqueue is that it doesn’t have the same security concerns that vhost-net does. Therefore, unless I am missing something, multiqueue should be the first step, with vhost-net only being used if virtio-net + multiqueue is not fast enough. Part of the case for vhost-net is “the needed performance cannot be achieved any other way” — if it can be achieved with virtio-net + mutliqueue, then vhost-net would not be needed.

majek commented 10 months ago

Apologies, I didn't answer the io_uring question: io_uring on tap is not faster than tranditional read/write on tap. See tap blog post.

Multiqueue without vhost-net makes no sense to me. Getting multiqueue into vanilla/classic firecracker read/write tap mode would introduce loads of complexity (for example event loop would need to be threaded) and bring too small benefits. Furthermore, virtio-mmio doesn't do multi-irq, and multitqueue without multi-irq is not worth it. We managed to get MQ vhost-net working and hacked together multi-irq thing. We attempted to start virtio-spec discussion about multi-interrupt for mmio. Again - MQ is another big subject for another ticket.

If there is a takeaway here it's that classic firecracker network is slow for UDP or non-offloaded TCP. The solution is vhost-net which is blazing fast even on single CPU. For super extra 10gbps + networking it's possible to get MQ vhost-net and basically forget about networking speed in firecracker, but this is harder and bigger job. The first step is to get vhost-net into firecracker, even if it's only a third-party firecracker fork. However, such a fork is hard to maintain, since it requires some important changes to firecracker core which change super often, therefore we need to get this PR first https://github.com/firecracker-microvm/firecracker/pull/4312

DemiMarie commented 10 months ago

Why is vhost-net a better choice than XDP?

majek commented 8 months ago

cloud-hypervisor uses vhost-user-net https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/docs/device_model.md#vhost-user-net https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/docs/vhost-user-net-testing.md

DemiMarie commented 8 months ago

@majek: Would virtio-net + multiqueue + XDP be fast enough? Or would it require too many CPU cores to be spent on busy polling?

DemiMarie commented 8 months ago

@kalyazin: What if the kernel virtqueue code was rewritten in Rust?

kalyazin commented 8 months ago

@kalyazin: What if the kernel virtqueue code was rewritten in Rust?

The major problem we have with vhost-net is Firecracker will increase its reliance on Linux kernel code in a multitenant environment. While rewriting the kernel virtqueue code in Rust would make it safer, in the case it fails the blast radius will still span multiple tenants, which is not acceptable for us.

DemiMarie commented 8 months ago

@kalyazin: What if the kernel virtqueue code was rewritten in Rust?

The major problem we have with vhost-net is Firecracker will increase its reliance on Linux kernel code in a multitenant environment. While rewriting the kernel virtqueue code in Rust would make it safer, in the case it fails the blast radius will still span multiple tenants, which is not acceptable for us.

What makes the virtqueue code unacceptable, while KVM and Linux’s packet routing code is acceptable? Those are already exposed. How large is the virtqueue code compared to e.g. KVM’s irqchip and instruction emulation?

For what it’s worth, Qubes OS makes essentially the opposite tradeoff: Qubes OS runs the Xen block device backend in dom0, but does not use dom0 for packet processing at all.

kalyazin commented 8 months ago

@kalyazin: What if the kernel virtqueue code was rewritten in Rust?

The major problem we have with vhost-net is Firecracker will increase its reliance on Linux kernel code in a multitenant environment. While rewriting the kernel virtqueue code in Rust would make it safer, in the case it fails the blast radius will still span multiple tenants, which is not acceptable for us.

What makes the virtqueue code unacceptable, while KVM and Linux’s packet routing code is acceptable? Those are already exposed. How large is the virtqueue code compared to e.g. KVM’s irqchip and instruction emulation?

For what it’s worth, Qubes OS makes essentially the opposite tradeoff: Qubes OS runs the Xen block device backend in dom0, but does not use dom0 for packet processing at all.

While KVM and Linux routing code are dependencies, they are already there. We would not like to add another one (even though a small one), especially where the kernel code would be interacting with an untrusted actor (guest) via a virtqueue directly, without Firecracker mediating it. That is the guidance we received from our security experts.

DemiMarie commented 4 months ago

@kalyazin: What if the kernel virtqueue code was rewritten in Rust?

The major problem we have with vhost-net is Firecracker will increase its reliance on Linux kernel code in a multitenant environment. While rewriting the kernel virtqueue code in Rust would make it safer, in the case it fails the blast radius will still span multiple tenants, which is not acceptable for us.

What makes the virtqueue code unacceptable, while KVM and Linux’s packet routing code is acceptable? Those are already exposed. How large is the virtqueue code compared to e.g. KVM’s irqchip and instruction emulation? For what it’s worth, Qubes OS makes essentially the opposite tradeoff: Qubes OS runs the Xen block device backend in dom0, but does not use dom0 for packet processing at all.

While KVM and Linux routing code are dependencies, they are already there. We would not like to add another one (even though a small one), especially where the kernel code would be interacting with an untrusted actor (guest) via a virtqueue directly, without Firecracker mediating it. That is the guidance we received from our security experts.

Has using AF_XDP on the host been considered? That would be very fast, but requires that Firecracker be privileged on the host. Alternatively, vhost_net could be used by Firecracker for its own networking.

bchalios commented 4 months ago

Hi @DemiMarie, thanks for bringing this up. We haven't considered this (I would need to do some reading on it). One first thing to notice is that running privileged Firecracker processes is not something we consider safe in production environments. In fact, we are strongly suggesting that users launch VMMs via the jailer process to further isolate them in host user-space

vhost_net could be used by Firecracker for its own networking.

I assume by "its own networking" you mean networking that is not done on behalf of the guest, but handles VMM specific traffic(?) If that is the case, at the moment we don't have such traffic.

In any case, I will go on and close this issue as we don't plan to support vhost-net currently. However, we continue investigating solutions to increase network performance for Firecracker and we are extremely happy to discuss ideas with the community. We are tracking this effort at #4364

firecracker-microvm / firecracker

[Feature Request] Adding vhost-net support to firecracker #3707

problem statement

vhost-net vs existing virtio net device

vhost-net vs vhost-user-net