amzn / amzn-drivers

Official AWS drivers repository for Elastic Network Adapter (ENA) and Elastic Fabric Adapter (EFA)
451 stars 174 forks source link

BQL off by default #262

Closed vladum closed 1 year ago

vladum commented 1 year ago

Hi!

This is not necessarily an issue, but a question. Apologies if I should post in a different place.

Is there a particular reason for keeping BQL off by default (enable_bql = 0) in this Linux driver? Most Linux distributions (systemd-based ones?) come with fq_codel as default qdisc, which relies on BQL [1].

Are there any trade-offs to be considered when enabling it?

Thanks!

davidarinzon commented 1 year ago

Hi @vladum,

The ENA device has it’s own proprietary network flow pacing which could delay completions to the host for optimal performance. BQL might cause head-of line blocking, which could add unnecessary delays in some cases. Therefore, BQL is currently disabled by default. You may still enable it via module param.

dtaht commented 1 year ago

I am always big on benchmarks. The very minimal amount of blocking BQL does, punts smarter decision making up the stack, where fq_codel/cake/etc can make different decisions, drop packets, to hold queuing latencies short. It does remain entirely possible that whatever your offload does is better than this, but I am a big believer in benchmarks to prove it, or not. A pretty good test series is the flent rrul test, with BQL on or off, fq_codel on or off, with the --step-size=.05 --socket-stats option enabled, to see into the status of the tcp flow itself.

Another test is to observe whether or not fq_codel is having any effect at all - tc -s qdisc show - after some heavy set of loads (say 32 or more flows. TSQ keeps things under control until about 15) . No reschedules, or drops, or ecn marks would indicate you had bypassed that portion of the stack entirely.

Does the ENA device have any FQ or AQM queue management of its own? I know some are using cake down there...

vladum commented 1 year ago

Given this, would mq + noqueue make sense in the guest, instead of keeping the default fq_codel?

I was going to ask about FQ as well. Especially since I have to bunch flows into an encrypted tunnel. Some TC filter magic would have probably allowed me to work around that, if the queue was controlled by qdisc. But I'm not sure how to handle this situation with the device controlling the queue.

dtaht commented 1 year ago

I don't really know. Please measure? Some techniques outlined here: https://blog.cerowrt.org/post/flaws_in_flent/

I have in general, made the assumption that the underlying substrates the cloudy providers were building for multi-tenant vms took into account sane means of providing sufficient backpressure for the overlying host to do smart things about multiplexing flows better. We put fq_codel into openstack pretty early. I know google did the right things here. Apparently, microsoft didn't and finally noticed, recently ( https://github.com/dankamongmen/omphalos/issues/69#issuecomment-1411432454 )

The container revolution was contained by fq_codel running on the egress interface + BQL. The virto device however, now so commonly used, does not have BQL and I can demonstrate that misbehaving pretty easily. All in all, far, far too few take packet captures and plot RTTs on the path.

PS an irony for me, is that the default codel target of 5ms and interval are for internet RTTs. In the datacenter: on offloaded hardware I have been successfully running cake at 50us target with a 1ms interval, using RFC3168 style ECN, at no cost in throughput, for ages.

To TRY and answer your question better - fq_codel IF it is indeed the bottleneck link - automatically keeps the hash from an encrypted tunnel and fq's the result. Let me go give an example of how wonderful that is...

dtaht commented 1 year ago

This is what happens to voip traffic in your typical ipsec tunnel, where it "rides the sawtooth".

image

If, on the other hand, fq_codel or cake is the bottleneck, and the VPN is terminated there:

image

I have been known to just slap cake bandwidth X onto a virtual machine just to make sure any additional sillyness doesn't come from the underlying substrate.

davidarinzon commented 1 year ago

Discussed with AWS support