LibreQoE / LibreQoS

A Quality of Experience and Smart Queue Management system for ISPs. Leverage CAKE to improve network responsiveness, enforce bandwidth plans, and reduce bufferbloat.
https://libreqos.io/
GNU General Public License v2.0
446 stars 48 forks source link

top level shapers and CPU affinity ? #127

Closed syadnom closed 1 year ago

syadnom commented 2 years ago

How does CPU affinity work with a top level shaper?

1Gbps fiber, a handful of backhauls going out to 16 sites. A hundred or so APs, and a few hundred customers.

4 core i5-7500 @3.4Ghz.

Is a top level shaper going to get 'stuck' on one of the cores on the CPU?

I would be using a 'complex' tree, ie 1Gbps fiber -> backhaul -> site+APs+Customers-> backhaul -> site+APs+Customers

rchac commented 2 years ago

Top level nodes are bound to a single CPU. So for each top level node you can expect this much throughput:

For your example it will scale up to at least 2.5 Gbps before you need to break out the top level node to multiple top level nodes. At that point, let's say you're at 3 Gbps fiber, and you want to be certain you never exceed that limit, you could do 1.5 Gbps for two top level nodes, or a different split depending on which top level nodes need more load.

syadnom commented 2 years ago

ah, ok. All starting to make sense. ie, get the fastest Mhz in a cores you can, favor individual core performance to core count.

And to get around that top level shaper bottleneck, basically just logically divide the data up. 10Gbps uplink and 5 backhauls would get 5 separate top level shapers at ~2Gbps each and 5 real CPU cores to match.

That works, I just need to break the gigabit barrier to offer those plans. ie, if I just assume 15Mbps on average (aiming way high) then my shaper needs to be max plan speed + (client count * 15) so I can still get full speed tests across the shaper.

Am I making sense ?

rchac commented 2 years ago

ah, ok. All starting to make sense. ie, get the fastest Mhz in a cores you can, favor individual core performance to core count.

And to get around that top level shaper bottleneck, basically just logically divide the data up. 10Gbps uplink and 5 backhauls would get 5 separate top level shapers at ~2Gbps each and 5 real CPU cores to match.

Exactly right.

That works, I just need to break the gigabit barrier to offer those plans. ie, if I just assume 15Mbps on average (aiming way high) then my shaper needs to be max plan speed + (client count * 15) so I can still get full speed tests across the shaper.

Am I making sense ?

That makes sense. I'd say you could be as aggressive as max plan speed + (client count * 6). That's what my network uses and it's worked well with ~400 subs. Half of our subs are on 200 Mbps plans and a handful are on 500 Mbps plans. With that sort of overhead on the shaper it's likely clients will see the right bandwidth test results the vast majority of the time with no complaints.

syadnom commented 2 years ago

yeah, we're just moving up to dramatically faster plans delivered on wave and cnwave and soon epmp4600. We've steadily moved up from 2Mbps to about 5Mbps average over the last 2 years so could likely get away with 6Mbps today.

dtaht commented 2 years ago

I would in general also favor lots of cache. I'd really like to know how an intel box with tons of cache compared to a amd box. Intel used to be at least much better at doing dma to cpu cache than amd was. And after all this talk of gbits, seeing how close to 100gbit we could get with, say, 64 cores.... or a smarter network card...

the future seems so bright, and I can't find my shades.

dtaht commented 2 years ago
    CAKE: 2.5 to 3 Gbps per top level parent node
    fq_codel: ~4 Gbps per top level parent node

pedant mode

This claim makes me nervous. A single flow test is one thing, a test with dozens, or thousands of flows, in a mix of packet sizes, is another. Especially with FQ, as it stresses out TCAM (especially if you don't have tcam!). In general we see a falloff in how both fq-codel and cake perform as you add flows. So I would like it if there was more exaustive testing of more realistic looking traffic before declaring these numbers to be good guidelines. Try a flent 1000 flow test, or rtt_fair_var for as many destinations as you can muster or...

Was it fq_codel quantum 1514?

I am also curious if gso-splitting was on on the cake test?

And someday when all the excitement dies down a little, would love to profile what the real bottlenecks are in a test like this.

end pedant mode

interduo commented 2 years ago

@dtaht

Was it fq_codel quantum 1514?

What is recommended quantum setting?

I am also curious if gso-splitting was on on the cake test?

What is recommended setting (gso-splitting) for cake?

dtaht commented 2 years ago

fq_codel quantum 300 works best in practice below 200Mbit, as that is the average packet size on the internet. At higher rates it should be set to the MTU of the link, which is usually 1514.

Cake autoscales this at rates when you feed it the bandwidth parameter. There's a goof in that there's no way to set it at the command line, and is always 1514 in that case.

In general we have found gso-splitting to be very useful at all rates to better FQ traffic, and it's one of the big advantages over fq_codel that it does this. It also makes the AQM work better. It can also be very high overhead when transiting the routing portion of the stack (e.g. on ingress) I would measure but be reluctant to ever turn it off.

interduo commented 2 years ago

Does the LibreQoS uses those two recommended settings?

rchac commented 2 years ago
    CAKE: 2.5 to 3 Gbps per top level parent node
    fq_codel: ~4 Gbps per top level parent node

pedant mode

This claim makes me nervous. A single flow test is one thing, a test with dozens, or thousands of flows, in a mix of packet sizes, is another. Especially with FQ, as it stresses out TCAM (especially if you don't have tcam!). In general we see a falloff in how both fq-codel and cake perform as you add flows. So I would like it if there was more exaustive testing of more realistic looking traffic before declaring these numbers to be good guidelines. Try a flent 1000 flow test, or rtt_fair_var for as many destinations as you can muster or...

Fair point! The CAKE figures are based on my current load on my real network, but I forgot I'm using a measly Ryzen 5 3600 CPU.

On my network right now, one top level node / CPU core during peak hours is showing 26% utilization with 700Mbps throughput on that node. I had estimated 2600 Mbps max on that core with CAKE. But you're right, it's not necessarily linear so it may end up being a bit more or less. And this is with a cheap AMD Ryzen 5 3600 CPU. With a modestly priced AMD Ryzen 9 7900X that could be doubled based on its two-fold single-thread performance score. So 4Gbps with CAKE is more likely than I suggested, sorry. =) The higher fq_codel performance was based on tests done with iperf before deploying the box in production, but you are totally right that that doesn't correspond well to real world traffic.

rchac commented 1 year ago

With recent LibreQoS performance improvements over the last few months (10Gbps CAKE single subscriber on an E-2378G) this is probably no longer a practical concern for high performance CPUs released within the last 4 years. We now have some basic documentation regarding top end throughput here.