LibreQoE / LibreQoS

A Quality of Experience and Smart Queue Management system for ISPs. Leverage CAKE to improve network responsiveness, enforce bandwidth plans, and reduce bufferbloat.
https://libreqos.io/
GNU General Public License v2.0
413 stars 45 forks source link

mikrotik tree #120

Closed dtaht closed 3 months ago

dtaht commented 1 year ago

As near as I can tell, most of libreqos, targetting linux htb + cake, is actually also exportable in a mikrotik format, just using their configuration keywords to generate a conf file. No xdp, obviously, but...

interduo commented 1 year ago

Is Mikrotik affected by single-lock issue?

syadnom commented 1 year ago

I'm not seeing many problems other than even the best CPU being fairly slow for shaping. They're going for a scale out approach with their annapurna labs network CPUs. I have CCR2116 in production and they are a beast of a router but an i5-2400 quad core from the stone age can handle more bits for traffic shaping.

For small operators it would be pretty neat to run libreqos as an OOB service, pushing configs to mikrotik via API. Especially if you were virtualizing your libreqos off-prem. Great for small operators in bandwidth deprived areas that may not have the resources to put in a PC or nice NICs etc etc. A hAP ac2 can handle a couple hundred Mbps in cake, hAP ax2 nearly double that.

rchac commented 1 year ago

Could be neat. Is there any way to pull real qdisc stats our of mikrotik? That's how we collect bandwidth stats at the moment.

syadnom commented 1 year ago

here's a little queue tree on routeros v7.5 with fqcodel. Not a ton of info, bytes, dropped packets, queue packets available.

image image image

dtaht commented 1 year ago

I am far from convinced they are actually collecting drop stats from fqcodel or cake at all. Can you saturate a link (a udp flood will do, so would a ping -f) and see if you get anything?

syadnom commented 1 year ago

fqcodel and cake do not populate the dropped. fifo queues do.

Image is both routers (mikrotik rb5009) across a 10G SFP+ port with a 5x5 cake shaper and both sides running a UDP bandwidth test targetting each other across the shaper. Shows queued packets but I can't convince it to drop anything.

image

syadnom commented 1 year ago

same test with fifo. so this confirms at least on 7.5 that drops are not being tracked in an accessible way on fqcodel or cake. It DOES work for every other queue type, fifo, sfq, red.

image

dtaht commented 1 year ago

I don't have a relationship with mikrotik, can you bug report this? Also, request ecn marks? seeing ever more of those...

While I'm making feature requests in the wrong forum... it would be so great if we could do inbound shaping with pure cake on what I think is called the "interface queue". It's only 4 lines of code to do this in sqm....

syadnom commented 1 year ago

bug report submitted.

interface queues work in mikrotik. so do bridge shapers if you turn on 'ip firewall' on a bridge.

I altered that last test to show this. I also set it to 5x11 just so you can see the difference present in the throughput:

image

syadnom commented 1 year ago

Test on TCP but just one test running, just because it's easier to read than my convoluted dual bandwidth test. image

syadnom commented 1 year ago

figured since I'm running tests, might as well show rb5009 potential here.

UDP one way cake basically maxed with the mikrotik bandwidth test doing to work. There's a little more to be extracted here if running iperf on a separate box. Note, this is using all CPU cores so anything that'll lock things to a single care will be 1/4 of these. ~2.1Gbps image

TCP tests about 3.8Gbps. Same stipulations as above. I think the TCP test itself on this particular hardware runs better than the UDP test. I suspect iperf would correct the UDP/TCP descrepancy. image

This is a Marvell Armada CPU (4 core 1.4Ghz). The Annapurna models in the CCR2xxx series are even faster (16 core 2Ghz) and I believe better IPC.

queue tree cannot do interface matching, it hangs off the global matcher (interface agnostic) and relies on packet marks on child queues.

dtaht commented 1 year ago

I am not sure if we are talking past each other or not. The simple queues feature leverages tbf + cake or fq_codel to do the shaping inbound or outbound or both. An interface queue can shape outbound only (via cake's bandwidth parameter), but not inbound. In general, I prefer a world where the cpe shapes outbound -> ISP and ISP shapes inbound -> CPE as that does the bottleneck detection and smartest drop, ack-drop, and rescheduling possible.

However historically, since the ISPs were slow to move we saw the rise of middleboxes like preseem and now libreqos - doing it both ways, and also individuals doing it on their own routers, shaping inbound as well. It's only four commands in linux

ip link add name SQM_IFB_050ec type ifb
tc qdisc replace dev SQM_IFB_050ec root cake bandwidth whatever
tc filter add dev $IFACE parent ffff: protocol all prio 10 u32 \
        match u32 0 0 flowid 1:1 action mirred egress redirect dev $DEV
ip link set dev dev SQM_IFB_050ec up

The deficit style shaper in cake we use is more efficient than tbf in most circumstances, and most importantly never bursts. So if somehow mikrotik could support this in the interface queues, it would be a win. As it stands, I'm fairly content to just shape at the cpe outbound using the bandwidth parameter.

cake is also intensely programmable via tc filters also.

dtaht commented 1 year ago

The rb5009 looks really attractive! the ccr2xxxx even more so... Btw, I am unsure that a single iperf flow on a single box can crack 4Mbit in the first place, due to running out of local buffer space on either tx or more likely rx. Hit it with 4 or more?

A fifo, no shaping, can do what on this hardware? an interface queue of fq_codel? cake no shaping cake shaped to 2Gbit? cake shaped via the toke bucket?

I of course am a really big fan of flent, especially the rrul test - which uses netperf. Shoulda ported it to iperf, too.

token buckets do have the advantage of an easy offload to hardware which I think mikrotik is doing in many cases.

dtaht commented 1 year ago

(and thx very much for filing the bug report! Does it have a number? I can go be a pest elsewhere...)

syadnom commented 1 year ago

mikrotik report SUP-94551

for the 'interface' shaping on mikrotik, that presents as a bi-directional shaper so must not be 'interface' in the same context.

pfifo with a 500 packet buffer can do 4.3Gbps UDP one way, 4.8Gbps TCP one way. ~97% CPU.

Cake with no limits, about 4.4Gbps Cake at 2Gbps, ~70% CPU top level fifo 'unlimited' to any child cake shaper same results for the single stream test. the fifo wide open/no shaper TCP 5.1Gbps, UDP 8.7Gbps.

Keep in mind that according to mikrotik's not-so-great task manager, my bandwidth test is using up ~6% of the CPU. Also, UDP wide open w/ the test running at 8.7Gbps is ~80% CPU. There is more headroom here taking the bandwidth generator/receiver off-device. Also, mikrotik's bandwidth test is pretty primative. I wouldn't count on these numbers with any precision. That said, for a small 1-2Gbps aggregate operator, this hardware and a well designed queue tree is pretty legit.

dtaht commented 1 year ago

I poked into the mvpp2 switch driver. No BQL, full support for XDP, strong candidate for openwrt + XDP + BQL (6 lines of new code), might be able to push line rate.

syadnom commented 1 year ago

I don't know that openwrt has been successfully run on these yet, likely just lack of avialability from someone who likes jamming openwrt into various hardware.

dtaht commented 1 year ago

There are multiple active efforts over here: https://forum.openwrt.org/t/add-support-for-mikrotik-rb5009ug/104391/760 - people are wrestling with cpu governors and the port to 5.15 presently, but I expect that to get sorted out the more folk leap on it.

SirBryan commented 1 year ago

Coming in here a couple weeks late, but I've been lurking between here and the MikroTik forums while I keep kicking the QoS can down the road. After testing Cake and fq-codel on a few hAP AC2's and AC3's at customer homes, I'm impressed with the results and looking to deploy shaping network-wide.

My LibreQoS box has been in line for a couple of months, my only hold-out being the UISP data being out of sync with reality (lots of MikroTik radios and CPE that I have yet to manually insert). It's twiddling its thumbs waiting for me to make it do something.

But over the weekend, I just upgraded my CCR1036 to 7.6. It handles CGNAT for roughly 500 devices, passing 1-2Gbps all day long and sits upstream of the Libre box. The improvements from 6.47 to 7.6 are enough to drop the CPU load from an average of 2-5% to 0% with the same amount of traffic (2Gbps). That leads me to believe, with its 36 1.4GHz cores, that it could easily handle shaping all of these queues, if we had LibreQoS pushing queue scripts to it instead of running tc locally. (Plus, my Libre box is just a NUC with a Thunderbolt cage for the Intel card...)

Roughly 66% of my customers have routers I've installed that can run RouterOS 7. Similar scripts could be run to deploy shaping on the CPE directly, especially in the upload direction. Even more cool would be polling UISP's radio stats (LTU upload bandwidth in particular) and updating the router's upload max to match.

(Interestingly, we did all this 20 years ago.. it's even patented.)

thebracket commented 1 year ago

Did Mikrotik ever manage to get queue trees out from the giant lock (that kept them firmly stuck on 1 CPU, no matter how many you have)? If that's resolved in 7.x, then feeding topologies into RouterOS shouldn't be too bad. In the 6.x line, a big queue tree was a sure way to bring a router to its knees.

syadnom commented 1 year ago

Did Mikrotik ever manage to get queue trees out from the giant lock (that kept them firmly stuck on 1 CPU, no matter how many you have)? If that's resolved in 7.x, then feeding topologies into RouterOS shouldn't be too bad. In the 6.x line, a big queue tree was a sure way to bring a router to its knees.

Pretty much still stuck there and really... always will be. Super inefficient to migrate data between CPU cores so a top level queue is pretty much stuck this way. This is essentially everyone's problem.

That said, since you do have lots of cores in some of these boxes what might be nice is to create queue trees for each backhaul to at least spread that load out, then maybe monitor these and dynamically update the parent shaper for each backhaul with the intent of keeping the primary uplink from getting congested yet getting the most possible data out. Might even have multiple 'top level' shapers on a backhaul to handle the tree from the secondary hop as well, so long as there's something monitoring and adjusting the main box.

Ultimately, a ridiculously fast CPU and a single top level shaper would be best, but I think we're kinda kitting a single core CPU wall here.

Message ID: @.***>

thebracket commented 1 year ago

inefficient to migrate data between CPU cores so a top level queue is pretty much stuck this way. This is essentially everyone's problem.

It really doesn't have to be - see the xdp-cpumap-tc project that powers LibreQos. It basically lets you decide which CPU gets which packet (by IP), and then steers it to the right part of a per-CPU queue tree. Very fast, and pretty flexible for steering your CPU load. :-) Mikrotik could do something like that; it's basically a Linux core, so I don't think it would be a stretch for them to use some eBPF magic. It's all open source, they are welcome to join the party!

I could definitely see some utility in a system that feeds queue data into Mikrotik routers; that risks neglecting the best part of LibreQoS (the shaper) and mostly using the integration APIs with a different back-end. I keep musing about having Cake queues on downstream routers, helping reduce buffer bloat along the line. (e.g. "Tower X has 65 Mbps of upstream (number chosen at random), use Cake to de-bloat that 65 Mbps and let LibreQoS at the core handle the bigger picture"). I haven't got beyond "I wonder if that would help?"

syadnom commented 1 year ago

xdp-cpumap-tc doesn't eliminate the latency between CPUs and caches. On an intel 11th gen CPU this is just under 30ns more latency per fetch.

On Tue, Nov 1, 2022 at 8:47 AM thebracket @.***> wrote:

inefficient to migrate data between CPU cores so a top level queue is pretty much stuck this way. This is essentially everyone's problem.

It really doesn't have to be - see the xdp-cpumap-tc project that powers LibreQos. It basically lets you decide which CPU gets which packet (by IP), and then steers it to the right part of a per-CPU queue tree. Very fast, and pretty flexible for steering your CPU load. :-) Mikrotik could do something like that; it's basically a Linux core, so I don't think it would be a stretch for them to use some eBPF magic. It's all open source, they are welcome to join the party!

I could definitely see some utility in a system that feeds queue data into Mikrotik routers; that risks neglecting the best part of LibreQoS (the shaper) and mostly using the integration APIs with a different back-end. I keep musing about having Cake queues on downstream routers, helping reduce buffer bloat along the line. (e.g. "Tower X has 65 Mbps of upstream (number chosen at random), use Cake to de-bloat that 65 Mbps and let LibreQoS at the core handle the bigger picture"). I haven't got beyond "I wonder if that would help?"

— Reply to this email directly, view it on GitHub https://github.com/rchac/LibreQoS/issues/120#issuecomment-1298616137, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKFOZOKCZ4AVEA673I3XXDWGEUOJANCNFSM6AAAAAAQVF2QVI . You are receiving this because you commented.Message ID: @.***>

thebracket commented 1 year ago

Obviously, but it elides the giant lock that keeps tc from spreading out on its own. A couple of hundred nanoseconds (even 3000 nanoseconds) is a really small price to pay if it then lets you spread the heavy-lifting (HTB, cake, etc.) over a large number of cores. It's all about amortizing your costs.

It's also not 30ns per fetch; the first fetch between one core's L1 cache is slow, after that it's almost guaranteed to be in the local core's cache. If you maintain locality from assigning the core (in the XDP program), through the TC bpf program and then the shaper itself you do far better than that. Otherwise, I wouldn't be timing the XDP programs under load (just under 5gbit/s) as low as 60 ns (and very occasionally as high as 3000 ns) - including two slow clock reads and a slow text format/kernel debug pipe output to obtain those numbers. Admittedly, there's a ton of work there to read forwards and avoid pointer chasing where possible.

Comparing that cost (and scanning the packet headers on the destination core has a lovely side-effect of pretty much ensuring that it's in L1/2 cache on the correct core when Cake/HTB runs - by reading the packet header on the correct core) versus having to run everything on a single core (while the rest idle), it's pretty obvious which will give you greater overall performance. (Mikrotik are even half way there, letting you pin NIC interrupt queues to cores - which can significantly improve performance if you get it right!)

syadnom commented 1 year ago

I don't know that 3ms is actually a small price...

And it really can't just be one fetch, because then you've just moved processing over to that core. it's a fetch every single time because you're having to keep data between both CPUs in sync. Further, during that fetch every core involved is at rest.

The point about putting NIC interrupts to specific cores is exactly the point. You get a dramatic increase in performance by not copying data between cores. or a rather dramatic loss if you do...

On Tue, Nov 1, 2022 at 9:21 AM thebracket @.***> wrote:

Obviously, but it elides the giant lock that keeps tc from spreading out on its own. A couple of hundred nanoseconds (even 3000 nanoseconds) is a really small price to pay if it then lets you spread the heavy-lifting (HTB, cake, etc.) over a large number of cores. It's all about amortizing your costs.

It's also not 30ns per fetch; the first fetch between one core's L1 cache is slow, after that it's almost guaranteed to be in the local core's cache. If you maintain locality from assigning the core (in the XDP program), through the TC bpf program and then the shaper itself you do far better than that. Otherwise, I wouldn't be timing the XDP programs under load (just under 5gbit/s) as low as 60 ns (and very occasionally as high as 3000 ns) - including two slow clock reads and a slow text format/kernel debug pipe output to obtain those numbers. Admittedly, there's a ton of work there to read forwards and avoid pointer chasing where possible.

Comparing that cost (and scanning the packet headers on the destination core has a lovely side-effect of pretty much ensuring that it's in L1/2 cache on the correct core when Cake/HTB runs - by reading the packet header on the correct core) versus having to run everything on a single core (while the rest idle), it's pretty obvious which will give you greater overall performance. (Mikrotik are even half way there, letting you pin NIC interrupt queues to cores - which can significantly improve performance if you get it right!)

— Reply to this email directly, view it on GitHub https://github.com/rchac/LibreQoS/issues/120#issuecomment-1298694536, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKFOZLG7MJAUNE6J57UXJ3WGEYN7ANCNFSM6AAAAAAQVF2QVI . You are receiving this because you commented.Message ID: @.***>

rchac commented 1 year ago

3000ns is just 0.003ms though. If we compare that to the increased latency a Mikrotik router will introduce to its forwarded traffic if a CPU core gets choked up by queues, that 0.003ms seems negligible, no?

thebracket commented 1 year ago

30 ns - nanoseconds. Thats 3e-5 ms - a really small number.

dtaht commented 1 year ago

My ambition was to get the RB5009 up on openwrt, with the BQL patches, and to try xdp-cpumap on that. The testing stalled out inconclusively, but in terms of testing everything but BQL, xdp, cake, shaping etc, a potential revolution is just a reflash away...

https://forum.openwrt.org/t/add-support-for-mikrotik-rb5009ug/104391/812

dtaht commented 1 year ago

@SirBryan

But over the weekend, I just upgraded my CCR1036 to 7.6. It handles CGNAT for roughly 500 devices, passing 1-2Gbps all day long and sits upstream of the Libre box. The improvements from 6.47 to 7.6 are enough to drop the CPU load from an average of 2-5% to 0% with the same amount of traffic (2Gbps).

Impressive.

That leads me to believe, with its 36 1.4GHz cores, that it could easily handle shaping all of these queues, if we had LibreQoS pushing queue scripts to it instead of running tc locally. (Plus, my Libre box is just a NUC with a Thunderbolt cage for the Intel card...)

However queue trees are very cpu intensive. I'd love merely to know what happens if you slam fq_codel (or cake) on each of those interfaces running native, and further, if it has any observable effect.

Roughly 66% of my customers have routers I've installed that can run RouterOS 7. Similar scripts could be run to deploy shaping on the CPE directly, especially in the upload direction.

+10. cake was designed primarily (originally) to run on the interface directly with it's own shaper, diffserv, ack-filter, and nat awareness. It is the right place, to stick it on the edge CPE, wherever possible. I urge you to start deploying it.

Even more cool would be polling UISP's radio stats (LTU upload bandwidth in particular) and updating the router's upload max to match.

While this is a good thing, I've punted your requests to the v1.4 release. The v1.3 release is hits beta nov 15th, would it be possible for you to test it on a small portion of your existing network?

syadnom commented 1 year ago

The CCR10xx series has a lot of really poor general purpose CPUs, basically 20 year old PowerPC cores with a network processor on top (TILE/Tileara). Great for routing, terrible for shaping. Well under 1Gbps shaping with cake. hAP AC2 has a vastly superior chip for shaping, at least twice as fast if not more.

CCR2xxx series are much much better Annapurna ARM CPU and often a Marvella hardware routing chip onboard as well.

On Fri, Nov 4, 2022 at 10:52 AM Dave Täht @.***> wrote:

@SirBryan https://github.com/SirBryan

But over the weekend, I just upgraded my CCR1036 to 7.6. It handles CGNAT for roughly 500 devices, passing 1-2Gbps all day long and sits upstream of the Libre box. The improvements from 6.47 to 7.6 are enough to drop the CPU load from an average of 2-5% to 0% with the same amount of traffic (2Gbps).

Impressive.

That leads me to believe, with its 36 1.4GHz cores, that it could easily handle shaping all of these queues, if we had LibreQoS pushing queue scripts to it instead of running tc locally. (Plus, my Libre box is just a NUC with a Thunderbolt cage for the Intel card...)

However queue trees are very cpu intensive. I'd love merely to know what happens if you slam fq_codel (or cake) on each of those interfaces running native, and further, if it has any observable effect.

Roughly 66% of my customers have routers I've installed that can run RouterOS 7. Similar scripts could be run to deploy shaping on the CPE directly, especially in the upload direction.

+10. cake was designed primarily (originally) to run on the interface directly with it's own shaper, diffserv, ack-filter, and nat awareness. It is the right place, to stick it on the edge CPE, wherever possible. I urge you to start deploying it.

Even more cool would be polling UISP's radio stats (LTU upload bandwidth in particular) and updating the router's upload max to match.

While this is a good thing, I've punted your requests to the v1.4 release. The v1.3 release is hits beta nov 15th, would it be possible for you to test it on a small portion of your existing network?

— Reply to this email directly, view it on GitHub https://github.com/rchac/LibreQoS/issues/120#issuecomment-1303871135, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKFOZONRGAJ22UIX7EWDKTWGU5OTANCNFSM6AAAAAAQVF2QVI . You are receiving this because you commented.Message ID: @.***>

dtaht commented 1 year ago

I'm all into doing research on the CCR2xxx... in the v1.4 release cycle. If you can spare the cycles and pound it through a few cake benchmarks (interface queues vs shaped queues), in the meantime, that would be great.

But I really, really need two "virgins" to do a test install of 1.3 when it hits beta, and give feedback on the documentation, gui, and actual performance... @syadnom - are you up for that?? Or have you already deployed 1.2 and are following along on the development branches for 1.3? In particular, I would like cpu usage vs bandwidth on this side of things...

https://github.com/thebracket/cpumap-pping/issues/2

syadnom commented 1 year ago

I could probably do that, I have 3 new sites coming online in the coming weeks. I don't however have routers in hand for them yet lol. Or a shaping box setup yet.

On Fri, Nov 4, 2022 at 11:21 AM Dave Täht @.***> wrote:

I'm all into doing research on the CCR2xxx... in the v1.4 release cycle. If you can spare the cycles and pound it through a few cake benchmarks (interface queues vs shaped queues), in the meantime, that would be great.

But I really, really need two "virgins" to do a test install of 1.3 when it hits beta, and give feedback on the documentation, gui, and actual performance... @syadnom https://github.com/syadnom - are you up for that?? Or have you already deployed 1.2 and are following along on the development branches for 1.3? In particular, I would like cpu usage vs bandwidth on this side of things...

thebracket/cpumap-pping#2 https://github.com/thebracket/cpumap-pping/issues/2

— Reply to this email directly, view it on GitHub https://github.com/rchac/LibreQoS/issues/120#issuecomment-1303909221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKFOZPPRMMCDVBRGTKRAKTWGVAYPANCNFSM6AAAAAAQVF2QVI . You are receiving this because you were mentioned.Message ID: @.***>

SirBryan commented 1 year ago

The CCR1036 is currently pushing 1Gbps. The two SFP+ ports are a bonded pair to a CRS317. Both in- and outbound traffic come in and leave on those via VLANs. I just enabled Cake on the SFP+ ports as the interface queue and load went from 0% to an occasional 1%.

Screen Shot 2022-11-04 at 11 45 50 AM

I started deploying 1.1, then dropped it, then just started looking at 1.2. I'm willing to look at 1.3.

I also have 2116's in production that we could test some shaper loading on.

As for deploying Cake on CPE's, MikroTik bridges four ethernet ports and WiFi together. In those cases I've used queue trees to assign the shaper to the bridge, since MikroTik won't assign cake or fq-codel to "virtual" interfaces like bridges, bonds, VLANs, etc. I'll have to experiment more with a shaped version of cake on those types of interfaces.

syadnom commented 1 year ago

mikrotik will do shapers on a bridge, you have to turn bridge firewall on.

On Fri, Nov 4, 2022 at 12:19 PM Bryan @.***> wrote:

The CCR1036 is currently pushing 1Gbps. The two SFP+ ports are a bonded pair to a CRS317. Both in- and outbound traffic come in and leave on those via VLANs. I just enabled Cake on the SFP+ ports as the interface queue and load went from 0% to an occasional 1%.

[image: Screen Shot 2022-11-04 at 11 45 50 AM] https://user-images.githubusercontent.com/1480526/200041770-ceeede2e-b572-48fc-9c4f-e677d513edd6.png

I started deploying 1.1, then dropped it, then just started looking at 1.2. I'm willing to look at 1.3.

I also have 2116's in production that we could test some shaper loading on.

As for deploying Cake on CPE's, MikroTik bridges four ethernet ports and WiFi together. In those cases I've used queue trees to assign the shaper to the bridge, since MikroTik won't assign cake or fq-codel to "virtual" interfaces like bridges, bonds, plans, etc. I'll have to experiment more.

— Reply to this email directly, view it on GitHub https://github.com/rchac/LibreQoS/issues/120#issuecomment-1303973049, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKFOZJKWYOU3LUPPMVYWJDWGVHRNANCNFSM6AAAAAAQVF2QVI . You are receiving this because you were mentioned.Message ID: @.***>

dtaht commented 1 year ago

But you don't want to do that. You want to shape the wireless interface to it's limit and the other ethernets to theirs (or not at, let 'em run at line rate with fq_codel or cake enabled). In linux at least, I might bridge wifi (wlan0) to eth1,2,3,4

so that creates br0.

and in that case all I then do is

tc qdisc replace dev wlan0 root cake bandwidth 300Mbit (can do inbound too)

And problems solved, notably with multicast.

Most of my wifi runs fq_codel natively also, so I don't have to do even that.

There's no way to do interface queues underneath a bridge in mikrotik?

syadnom commented 1 year ago

There's no way to do interface queues underneath a bridge in mikrotik?

I don't think that works but I can test later.

Message ID: @.***>

thebracket commented 1 year ago

I'll see if I can bite the bullet and upgrade a router (one I can easily get to, if needs-be) and get it upgraded to ROS7 and give this a go. I'm a little confused by interface queues; I thought that only queued egress packets? So if I put a Cake interface queue on my upstream/outbound, wouldn't I want to set the bandwidth to my upload - and have another interface queue facing the internal network configured to the download bandwidth? It looks like I can attach an interface queue to a bridge, so presumably the customer-facing bridge would be setup that way.

It'll be a week or two before I can get to this.

On Fri, Nov 4, 2022 at 3:57 PM syadnom @.***> wrote:

There's no way to do interface queues underneath a bridge in mikrotik?

I don't think that works but I can test later.

Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/rchac/LibreQoS/issues/120#issuecomment-1304250472, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRU434AFZ2FSPKFIO6RZQLWGV2EHANCNFSM6AAAAAAQVF2QVI . You are receiving this because you commented.Message ID: @.***>

dtaht commented 1 year ago

Interface queues only shape egress packets. Currently. I'd like for mikrotik to make it possible to also shape via cake on inbound also, as it's only 4 lines of code.

dtaht commented 1 year ago

I could probably do that, I have 3 new sites coming online in the coming weeks. I don't however have routers in hand for them yet lol. Or a shaping box setup yet. On Fri, Nov 4, 2022 at 11:21 AM Dave Täht @.> wrote: I'm all into doing research on the CCR2xxx... in the v1.4 release cycle. If you can spare the cycles and pound it through a few cake benchmarks (interface queues vs shaped queues), in the meantime, that would be great. But I really, really need two "virgins" to do a test install of 1.3 when it hits beta, and give feedback on the documentation, gui, and actual performance... @syadnom https://github.com/syadnom - are you up for that?? Or have you already deployed 1.2 and are following along on the development branches for 1.3? In particular, I would like cpu usage vs bandwidth on this side of things... thebracket/cpumap-pping#2 <thebracket/cpumap-pping#2> — Reply to this email directly, view it on GitHub <#120 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKFOZPPRMMCDVBRGTKRAKTWGVAYPANCNFSM6AAAAAAQVF2QVI . You are receiving this because you were mentioned.Message ID: @.>

Just so you can do something with us, during the beta period, would be great, no matter how big, or how small. It's really hard to thoroughly test this stuff, even harder, when you've been heads down in the code, to understand how a user thinks, or their real requirements.

We have a "hold my beer, we got this" moment scheduled with an ISP that is going to attempt moving directly (overnight) from v1.1 to v1.3, Nov 17th and if we can do a few more of those the better the release will be for everyone.

@syadnom @SirBryan similarly your reviews of the documentation and installation process from a naive standpoint would be good. Anything that you can do to ensure the new codebase operates smoothly and doesn't crash would be good. There's a presently undertested live customer update facility now that is very fast! and saves a reload!, there's ipv6 support, there's a bunch of other things awaiting release notes, etc, etc etc. And good feedback on the new graphing techniques is also needed.

Thanks in advance, if you can shedule a few hours with us in that week to discuss and play, it would be great!

syadnom commented 1 year ago

I think I need to change speeds a bit. I need to get hardware in place that is well suited to virtualization so I can actually participate.

On Wed, Nov 9, 2022 at 11:19 AM Dave Täht @.***> wrote:

I could probably do that, I have 3 new sites coming online in the coming weeks. I don't however have routers in hand for them yet lol. Or a shaping box setup yet. … <#m8008698063660288303> On Fri, Nov 4, 2022 at 11:21 AM Dave Täht @.> wrote: I'm all into doing research on the CCR2xxx... in the v1.4 release cycle. If you can spare the cycles and pound it through a few cake benchmarks (interface queues vs shaped queues), in the meantime, that would be great. But I really, really need two "virgins" to do a test install of 1.3 when it hits beta, and give feedback on the documentation, gui, and actual performance... @syadnom https://github.com/syadnom https://github.com/syadnom https://github.com/syadnom - are you up for that?? Or have you already deployed 1.2 and are following along on the development branches for 1.3? In particular, I would like cpu usage vs bandwidth on this side of things... thebracket/cpumap-pping#2 https://github.com/thebracket/cpumap-pping/issues/2 <thebracket/cpumap-pping#2 https://github.com/thebracket/cpumap-pping/issues/2> — Reply to this email directly, view it on GitHub <#120 (comment) https://github.com/rchac/LibreQoS/issues/120#issuecomment-1303909221>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKFOZPPRMMCDVBRGTKRAKTWGVAYPANCNFSM6AAAAAAQVF2QVI https://github.com/notifications/unsubscribe-auth/ACKFOZPPRMMCDVBRGTKRAKTWGVAYPANCNFSM6AAAAAAQVF2QVI . You are receiving this because you were mentioned.Message ID: @.>

Just so you can do something with us, during the beta period, would be great, no matter how big, or how small. It's really hard to thoroughly test this stuff, even harder, when you've been heads down in the code, to understand how a user thinks, or their real requirements.

We have a "hold my beer, we got this" moment scheduled with an ISP that is going to attempt moving directly (overnight) from v1.1 to v1.3, Nov 17th and if we can do a few more of those the better the release will be for everyone.

@syadnom https://github.com/syadnom @SirBryan https://github.com/SirBryan similarly your reviews of the documentation and installation process from a naive standpoint would be good. Anything that you can do to ensure the new codebase operates smoothly and doesn't crash would be good. There's a presently undertested live customer update facility now that is very fast! and saves a reload!, there's ipv6 support, there's a bunch of other things awaiting release notes, etc, etc etc. And good feedback on the new graphing techniques is also needed.

Thanks in advance, if you can shedule a few hours with us in that week to discuss and play, it would be great!

— Reply to this email directly, view it on GitHub https://github.com/LibreQoE/LibreQoS/issues/120#issuecomment-1309181628, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKFOZK3GAOU5X3WYI2FYKTWHPTKZANCNFSM6AAAAAAQVF2QVI . You are receiving this because you were mentioned.Message ID: @.***>

SirBryan commented 1 year ago

I've got a site with 50ish customers on 50- and 100Mbps plans, using Ubiquiti LTU and AirMax. Right now the radios do the shaping, but in light of this thread, and as a quick POC, I thought I'd throw some queues into the RB4011 running the site.

At this site, two VLANs are shared by the six AP's. I need to migrate everybody to one VLAN per AP, which would then make it easier to shape to each AP's theoretical limits. But for a simple test, I created two master queues with an estimate of how much the AP's in aggregate could handle.

The repeatable part of the script parses the DHCP table, creates mangle filters and queues for each IP address, and gives them all 50Mbps queues for now. This was mainly to see the loading on the router. The site maxes out at about 350-400Mbps, so there's plenty of CPU to experiment with.

Right now I'm doing fq-codel, but could easily switch it to Cake. This shapes the download only at present; the upload links are a bonded pair with fq-codel enabled on the member ports as a hardware queue.

### Add fq-codel queue
/queue type
add fq-codel-limit=1024 fq-codel-quantum=300 kind=fq-codel name=fq-codel

### Set up tree (single VLAN for example)
/queue tree
add limit-at=325M max-limit=325M name=4009-main parent=vlan4009 queue=fq-codel
add limit-at=320M max-limit=320M name=4009-no-mark packet-mark=no-mark parent=4009-main queue=fq-codel

### The rest can be run as a script every few minutes, hours, or daily

# Remove existing mangle rules
/ip firewall mangle remove [ find ]

# Loop through DHCP leases and add a 50M FQ-Codel queue towards each customer IP
# "server4009" is name of DHCP server, "4009-main" is name of parent queue, which has 325Mbps
/ip dhcp-server lease
 :foreach line in=[find where server=server4009 ] do={ :local Address; :set Address [ get value=address $line ]; 
 /ip firewall mangle add action=mark-packet chain=forward dst-address=$Address new-packet-mark=$Address passthrough=no; 
/queue tree remove [ find where name=$Address ] ; 
/queue tree add limit-at=50M max-limit=50M name=$Address packet-mark=$Address parent="4009-main" queue=fq-codel;  
}

The customer queue data could easily be created and run from a different host (i.e. fed by UISP data), or DHCP data grabbed from remote routers and then the queues exported/loaded to a central shaping router.

dtaht commented 1 year ago

yep, you can just switch that to cake diffserv4 ack-filter

SirBryan commented 1 year ago

@dtaht Did a fresh install of 1.3 on an Ubuntu VM on top of ESXi 6.7 on a Mac Pro 2010 Xeon W3690. The Intel X710 is passed through to the VM. VM has 8GB of RAM and 4 cores (host CPU is a six-core w/HT).

Docs miss/skip the part where I have to run an integration script (integrationUISP.py in my case) to populate network.json and ShapedDevices.csv. Other than that it was pretty straightforward.

I have not put the machine in-line yet. This is just a report on the setup piece.

SirBryan commented 1 year ago

Also, the current version of InfluxDB that I just installed (2.5.1) is not accepting the dashboard template linked in the setup documentation. It did work on the first box I set up which is running 2.4.

rchac commented 1 year ago

@SirBryan Yes they changed the format of the dashboard starting with 2.5.1. Added.

SirBryan commented 1 year ago

Loaded v1.3 on an Intel NUC (CPU is a 6-core Intel i7-10710U) with an X520 in a Thunderbolt chassis (the setup has been inline for months just bridging traffic). The UISP integration script pulled 252 of my 440 subs from the database in flat mode (lots of cleanup yet to do in UISP to get the rest and to get AP's loaded up).

We're pushing 1.5Gbps through the network and after running just the LibreQoS script, utilization across all cores is 2-4% (0.15 load average) with Cake enabled and InfluxDB polling off.

After configuring the scheduler to run as a service, I started noticing throughput drop significantly at regular intervals according to the upstream MikroTik. I'm guessing the "refreshLatencyGraphs" function is CPU intensive enough on this little box that it can't handle publishing the data, neither to a local InfluxDB instance nor a remote one, while also shaping everything. I'll have to migrate to the Xeon and try it out.

But for shaping, it appears to be working.

rchac commented 1 year ago

@SirBryan I'm glad shaping is working, but that is interesting that the refreshLatencyGraphs function uses that much CPU. I had not observed that on other deployments. You see it happening every 30-45 seconds or so?

dtaht commented 1 year ago

Locking the bridge to 4 cores and the other stuff to the other cores might help. What's the ethernet chipset?

thebracket commented 1 year ago

Is Influx on a different box? It can get pretty heavy. I think we can reduce the graphing load in a future version.

On Mon, Nov 28, 2022, 4:34 PM Dave Täht @.***> wrote:

Locking the bridge to 4 cores and the other stuff to the other cores might help. What's the ethernet chipset?

— Reply to this email directly, view it on GitHub https://github.com/LibreQoE/LibreQoS/issues/120#issuecomment-1329842760, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRU432ZYL6OJLOOPIAKCUTWKUXPFANCNFSM6AAAAAAQVF2QVI . You are receiving this because you commented.Message ID: @.***>

SirBryan commented 1 year ago

@rchac Yes. Soon as I turned off InfluxDB updates, it shapes without any issues. We're now into peak hours and it's fluctuating between 1.3-1.8Gbps. Load is up to .33 with 4-6% per core.

@thebracket Whether the InfluxDB target was local or on the Xeon machine, it behaved the same. Traffic was limited to numbers around 30-40Mbps for a few seconds, then slowly crept back up to normal levels. I have top running and the cpumap//map:4 processes are at the top. Soon as refreshLatencyGraphs ran, the python3 process would catapult to the top and the cpumap processes would scatter or disappear.

@dtaht This is an intel X520 card (82599EN chipset, ixgbe driver).

I mostly wanted to see if the NUC could handle it, 1) because it's the hardware I had (and had used it for evaluating netElastic's FlexBNG and it worked well there), 2) deploying these closer to the edge is much easier than rack mount servers, and 3) I had put this in line months ago to test LibreQoS and finally had some down time to work on it today.

I'll see how well it works on the Xeon host and report my results.