firecracker-microvm / firecracker

Secure and fast microVMs for serverless computing.
http://firecracker-microvm.io
Apache License 2.0
25.03k stars 1.75k forks source link

[Bug] TCP congestion control and rate limiting #3402

Closed dtaht closed 1 year ago

dtaht commented 1 year ago

Describe the bug

Do you have plans to improve your TCP congestion controls, or am I missing something in your rate limiter?

/// The current implementation does not do any kind of congestion control, expects segments to /// arrive in order, triggers a retransmission after the first duplicate ACK, and relies on the /// user to supply an opaque u64 timestamp value when invoking send or receive functionality. The /// timestamps must be non-decreasing, and are mainly used for retransmission **timeouts.

https://github.com/firecracker-microvm/firecracker/blob/main/src/dumbo/src/tcp/connection.rs#L148

To Reproduce

Use the internet outside of the data center, take some packet captures.

Expected behaviour

I expected at the very least something compliant with TCP reno.

Environment

Anywhere.

Additional context

[Author TODO: How has this bug affected you?]

I have spent the last decade of my life trying to keep the internet from collapsing, with algorithms like fq_codel, cake, etc, to get things back on a more even keel, as well as on improving the linux tcp stacks congestion control mechanisms.

[Author TODO: What are you trying to achieve?]

Not having to worry about that anymore.

[Author TODO: Do you have any idea of what the solution might be?]

Use some form of congestion control (reno, cubic, bbr, pick one?)

roypat commented 1 year ago

Hi Dave,

thanks for opening this issue!

Could you elaborate a bit more why you think dumbo needs to implement TCP congestion controls? None of dumbo's TCP frames ever leave the firecracker process, as it is only used for communication between the guest and the MMDS store 1. This means that traffic sent from the guest through the virtio net device is inspected to see if its destination IP address is the specified MMDS IP, in which case the IP frames are processed by dumbo's TCP/IP stack. Dumbo's processing then assembles them into HTTP requests, which are passed to the in-process MMDS HTTP server. Since we control source, destination and everything in between, we are able to make significant simplifications to our TCP/IP stack that would not work if it were to be used to actually interact with a network.

dtaht commented 1 year ago

I am comforted. :) (seriously) It would have helped had the comment explained that.

So the in process MMDS http server is relying on the host TCP (or quic) stack for actual network communications?

I suppose allowing better burstyness (e.g, IWXXX) and avoiding a TCP-like negotiation entirely (shared memory?) would be a good thing as an IPC mechanism in this case.

It looks like there is possibility of loss, but you just retransmit at the full rate?

There is no need to have the senders rate slow down to match the actual capacity of the outside container? (or how does it do that? flow control? RFC3168 or L4S style ECN? The rate limiter?)

No need for a TCP_NOTSENT_LOWAT-like mechanism in-process?

I had wanted to look over your ´rate limiter´ next... but my rust is close to non-existent, as you can tell. thx for replying to grumpy ole me...

roypat commented 1 year ago

I am comforted. :) (seriously) It would have helped had the comment explained that.

Mh, yeah, you are right, that comment should probably link to our MMDS documentation, I'll make a note to add that!

So the in process MMDS http server is relying on the host TCP (or quic) stack for actual network communications?

Yes, configuring MMDS (which happens from outside the microvm) happens through the api_server crate, which uses standard UNIX facilities for networking.

I suppose allowing better burstyness (e.g, IWXXX) and avoiding a TCP-like negotiation entirely (shared memory?) would be a good thing as an IPC mechanism in this case.

It looks like there is possibility of loss, but you just retransmit at the full rate?

There is no need to have the senders rate slow down to match the actual capacity of the outside container? (or how does it do that? flow control? RFC3168 or L4S style ECN? The rate limiter?)

No need for a TCP_NOTSENT_LOWAT-like mechanism in-process?

MMDS is merely supposed to allow the guest to request some metadata about the microvm it is running in. Having it be an http server has historical reasons, I think. The guest can only ever send GET requests, which are handled completely by the MMDS server. We don't need to match capacity to anything outside the microvm. And since client and server are increadibly tightly coupled here, we don't need much of the sophistication TCP offers.

I had wanted to look over your ´rate limiter´ next... but my rust is close to non-existent, as you can tell. thx for replying to grumpy ole me...

The ratelimiter is only to prevent the "noisy neighbor" effect when running multiple microvms on the same physical host. So whenever something "leaves" a microvm, e.g. in the form of disk IO, we ratelimit it. MMDS traffic is actually exempted from it (since that traffic never leaves the VM, and we have other mechanisms for ensuring the VM stays within reasonable CPU/RAM limits).

dtaht commented 1 year ago

Thank you very much for clearing much of that up. I ALSO dream of all these microservices adding features to their actual -> proxy connections, and the proxies themselves, starting to monitor TCP_INFO better to the end users on the other sides of the internet. There be dragons there, and finally good tools for tracking that are appearing... and the news from samknows, etc, all bad.

Random plug on sampling TCP to the net better, leveraging kathie nichols pping tool:

https://github.com/thebracket/cpumap-pping#cpumap-pping