LibreQoE / LibreQoS

A Quality of Experience and Smart Queue Management system for ISPs. Leverage CAKE to improve network responsiveness, enforce bandwidth plans, and reduce bufferbloat.
https://libreqos.io/
GNU General Public License v2.0
447 stars 48 forks source link

Adding/Reloading single subscriber service parameters #49

Closed interduo closed 1 year ago

interduo commented 2 years ago

tc qdisc/class/filter support action change - this give possibility to not reload whole ruleset - but only change selected user DL/UP parameters.

This function would reolace full reload function in most cases (generating 5k+ rules is always taking much time)

rchac commented 2 years ago

That would definitely be useful for DL/UL parameters of a single host. But you wouldn't be able to change the assigned qdisc of a target IP. Usually when I need to make a change, I'm having to add or remove a target IP or change its qdisc, which because of how XDP works we can only achieve with a full teardown/reload of xdp-cpumap-tc IP filter rules. If just changing that single host's DL/UL is enough though for your use case we can look more into that as an API call.

interduo commented 2 years ago

That would definitely be useful for DL/UL parameters of a single host.

That is the most frequent function we use now.

But you wouldn't be able to change the assigned qdisc of a target IP

I imagine this like:

API call like set_subscriber_service with parameters:

API call set_subscriber_service do:

  1. check if API call get all required parameters (IPv4/IPv6 and network service throughoutput),
  2. check if there is a src/dst address for subscriber_service in tc rules true: remove only one single subscriber_service and add it again (or use tc change), false: add single only one single subscriber_service,

Now we use scratch linux (without XDP) and we do it like: we run at CRM: change_customer_service.sh 172.18.104.34 10240 10240 102400 102400

The script on QoS router generates this code:

tc filter del dev ens16np0.600 parent 1: handle 281:22:800 prio 1 protocol ip u32
tc filter add dev ens16np0.600 parent 1:0 prio 1 protocol ip u32 ht 281:0x22 match ip dst 172.18.104.34 flowid 1:3824
tc filter del dev ifb0 parent 2: handle 281:22:800 prio 1 protocol ip u32
tc filter add dev ifb0 parent 2:0 prio 1 protocol ip u32 ht 281:0x22 match ip src 172.18.104.34 flowid 2:3824
ipset add mark 172.18.104.34
tc class add dev ens16np0.600 parent 1:0 classid 1:3824 hfsc ls m2 102400kbit ul m2 102400kbit
tc class add dev ifb0 parent 2:0 classid 2:3824 hfsc ls m2 102400kbit ul m2 102400kbit
tc class change dev ens16np0.600 parent 1:0 classid 1:3824 hfsc ls m2 102400kbit ul m2 102400kbit
tc class change dev ifb0 parent 2:0 classid 2:3824 hfsc ls m2 102400kbit ul m2 102400kbit
tc qdisc add dev ens16np0.600 parent 1:3824 sfq perturb 10
tc qdisc add dev ifb0 parent 2:3824 sfq perturb 10

We don't want remove all the customers definitions - just readd one change classes. (we do full-reload every night for being sure that startup scripts contains all the need classes as a startup tc definitions)

Is this possible using XDP?

marsalans commented 2 years ago

Updating single entry would be great.

rchac commented 2 years ago

I think we can do that, since XDP just redirects IP traffic to its respective qdisc/class, which can be modified after the fact using tc class change.

interduo commented 2 years ago

I think all possible situations for function set_subscriber_service($params) are:

marsalans commented 2 years ago

I think all possible situations for function set_subscriber_service($params) are:

  • subscriber is added,
  • subscriber is removed (packets of customer should go to default queue),
  • subscriber is disabled (packets should be dropped),
  • subscriber got changed values of Rate/Ceil classes,
  • subscriber got added CircuitID,
  • subscriber got removed CircuitID,

ability to change sites

interduo commented 2 years ago

Changing sites would be complicated - this should be done by full reload. There are many corner edges for that and changing sites dont occure much... Am I wrong @marsalans?

At the end if site exists and we dont remove only host or add new site... changing site comes to remove+add customera function.

marsalans commented 2 years ago

I agree but, my point of view for this is lets say a there are many sites and all have queue of mix service plans, so if a site is short for bandwidth (full) and we modified it to a higher bandwidth then the subscriber will face trouble.

interduo commented 2 years ago

So this is not changing the sites in customer row but changing the sites parameters - that should be simple. This is not subject of this issue (this also should be done) - please create another issue its good to be concentrated on one thing.

interduo commented 2 years ago

@rchac my propose of the implementation be like:

--action=change --circuit-id="1234,1235" --dlrate=5M --dlceil=10M --uprate=5M --upceil=10M --comment="TEST123"

steps:

--action=remove --circuit-id="1234,12345"

steps:

--action=newcircuitid --parrent-node="AP_0" --dlrate=5M --dlceil=10M --uprate=5M --upceil=10M --comment="TEST123" --ips="172.20.20.20,172.20.20.30"

steps

--action=addip --circuit-id="1234" --ips="172.20.20.20,172.20.20.30"

steps:

--action=delip --circuit-id="1234" --ips="172.20.20.20,172.20.20.30"

steps:

-- All the things needs to be done on both interfaces and at the end it would be good to save currentConfig.csv.

rchac commented 2 years ago

I appreciate how thorough and well thought out this is. Thank you.

Before I proceed to do it this way - would this idea below be any more ideal? I'm thinking this would make your CRM integration easier, because you can just have it update ShapedDevices.csv continually throughout the day and LibreQoS would handle these edge cases for you.

So during the day this happens:

At 4AM each day, scheduled.py could run LibreQoS.py without the -soft-refresh parameter to fully reload the structure and all qdiscs

interduo commented 2 years ago

It's different approch but also suitable.

We will have <10 minutes lag or we will have to run LibreQoS.py --soft-refresh. It would be easier to implement at CRM side - just put the file and "Voilà".

interduo commented 2 years ago

I checked whole process: generating and uploading whole config takes about 4 seconds.

Well ... maybe we could use both approaches? If You just divide code into functions it would be easy to implement both scenarios.

If You do parameters it would be easy to implement this on API. And on API we could not have such a bad things like race conditions or case mentioned here --> https://github.com/rchac/LibreQoS/issues/52

interduo commented 2 years ago

I agree but, my point of view for this is lets say a there are many sites and all have queue of mix service plans, so if a site is short for bandwidth (full) and we modified it to a higher bandwidth then the subscriber will face trouble.

The validator should check if the customer rates sum is bellow site rates. If that occurs should rise a warning. Also this check should be done for sites/ap/pop. Modifying sites params would be also helpful it could be like:

--action=change --site-name="AP1" --dlrate=5M --dlceil=10M --uprate=5M --upceil=10M --comment="TEST123"

rchac commented 2 years ago

Added ability to change single circuit/subscriber shaping parameters with new lqTools.py

interduo commented 2 years ago

image

I will do deeply testing on monday.

For now I only find a need for:

interduo commented 2 years ago
  1. The output is little messy:
    
    root@rty-libreqos:~/LibreQoS/v1.2# python3 lqTools.py --help
    usage: lqTools.py [-h]
                  {change-circuit-bandwidth,change-circuit-bandwidth-using-ip,show-active-plan-from-ip,tc-statistics-from-ip}
                  ...

positional arguments: {change-circuit-bandwidth,change-circuit-bandwidth-using-ip,show-active-plan-from-ip,tc-statistics-from-ip} change-circuit-bandwidth Change bandwidth rates of a given circuit using circuit ID change-circuit-bandwidth-using-ip Change bandwidth rates of a given circuit using IP show-active-plan-from-ip Provide tc class info by IP tc-statistics-from-ip Provide tc qdisc stats by IP

options: -h, --help show this help message and exit


we just remove those lines:
              {change-circuit-bandwidth,change-circuit-bandwidth-using-ip,show-active-plan-from-ip,tc-statistics-from-ip}
              ...

positional arguments: {change-circuit-bandwidth,change-circuit-bandwidth-using-ip,show-active-plan-from-ip,tc-statistics-from-ip}


2. There is also need for showing an example of command use with change-circuit-bandwidth/change-circuit-bandwidth-using-ip or adding subarguments descriptions - because we would know the rest needed arguments (unless we don't view into source code).

For example:

lqTools.py --change-circuit-bandwidth-using-ip --ip-address="172.20.2.42" --min-download=2 --min-upload=2 --max-download=4 --min-upload=4



`lqTools.py --change-circuit-bandwidth --circuit-id=1000 --min-download=2 --min-upload=2 --max-download=4 --min-upload=4`

BTW I add +x for script in https://github.com/rchac/LibreQoS/pull/109

3. If any of the parameters are not set we could use the minimum possible value 2M.
4. It would be good to use the same `--ip-address` or `--ip`/`--ips` everywhere in libreqos.
interduo commented 2 years ago

I didn't find any critical bugs in productions --change-circuit-bandwidth-using-ip

I thinked about this issue a little much more in more wider perspective.

Generating ShapedDevices.csv file with 8k takes about 4seconds from CRM. It's the time that could be downed maybe to 3 or 2 seconds by making code better but i think that is end). Sending to LibreQoS is about ~0.3-0.5 sec. Checking the differences between files would take about few seconds more. (and if more queues it will take longer).

Making Your code scenario would block this function to be blazing fast. (if we add new circuit in worst scenario we do make changes at 3-4 points: dhcpd, queue, nat, sometime also in firewall). Many of ISP got the same structure. I would like to make all this steps in few seconds - full remaking of ShapedDevices.csv makes it impossible no.

When are You planning to implement adding/removing single circuits from cmdline? This would eliminate full reload need in our case almost to 0.

This function (--add-circuit-bandwidth) could be very minimalized as only:

@rchac What do You think about it? Simple enought to implement it in halt an hour?

rchac commented 2 years ago

Sorry last week I was overwhelmed by ISP work and forgot about this. That seems reasonable. I'll try to get that done soon, hopefully before next week.

interduo commented 2 years ago

I also got a busy week. Are You only a WISP?

rchac commented 2 years ago

Yeah just a WISP. 350 subscribers so I perform the installs with a partner. Hoping to add fiber soon. You?

interduo commented 2 years ago

Yeah just a WISP. 350 subscribers so I perform the installs with a partner. Hoping to add fiber soon. You?

In past only WISP, now the overwhelming majority is FO.

interduo commented 2 years ago

Sorry last week I was overwhelmed by ISP work and forgot about this. That seems reasonable. I'll try to get that done soon, hopefully before next week.

Just give me a sign when it will be done and I will test it. It's only missing part of integration between LQoS na our CRM - LMS.

interduo commented 2 years ago

@rchac did You tried to implement this?

rchac commented 2 years ago

@rchac did You tried to implement this?

It's halfway done, the code to create a data structure of the current config is already implemented. I just need to have it add to that structure with a command. I am going to a conference so it will probably be a week or so.

interduo commented 2 years ago

I just need to have it add to that structure with a command.

Seems to be half an hour work or I did not see some deeper problem?

I am going to a conference so it will probably be a week or so.

Have a good time there. What is this conference name?

rchac commented 2 years ago

I just need to have it add to that structure with a command.

Seems to be half an hour work or I did not see some deeper problem?

I am going to a conference so it will probably be a week or so.

Have a good time there. What is this conference name?

Thanks! I'm at WISPApalooza.

And you're right. It's not that much code to do but I wanted to be able to actively respond to any bug reports right away. I'll work on it as soon as I'm back mid-week.

interduo commented 2 years ago

@rchac are You, ok?

dtaht commented 2 years ago

Heh. I figure he's wiped out by the show. I only got some energy back today...

rchac commented 2 years ago

Hey my apologies. I started working on this right after the conference, and got burned out when ISP work unexpectedly picked up. 😔 We are pursing some fiber grants and its very involved.

As I was working on the patch I realized it really would be a kind of hack with the way I was planning to do it. I may need some input/feedback on how to best tackle this. I want to be sure we introduce a thorough solution to allow for comparisons of the current ShapedDevices.csv to prior loads (so CRM integration updates can be run frequently without duplicate actions).

So I was thinking we save a backup of the last loaded ShapedDevices.csv (ShapedDevices.lastload.csv) and compare them, with newly discovered entries being added, existing entries being compared and updated if bandwidth rates changed, and missing entries being removed. In queuingStructure.json, we can add queueCounter to the end so the program can pick up where it left off in numbering queues. It will take some work but it can be done.

interduo commented 2 years ago

Hey my apologies. I started working on this right after the conference, and got burned out when ISP work unexpectedly picked up. pensive We are pursing some fiber grants and its very involved.

Don't worry this is not burning feature few days doesn't matter.

As I was working on the patch I realized it really would be a kind of hack with the way I was planning to do it. I may need some input/feedback on how to best tackle this. I want to be sure we introduce a thorough solution to allow for comparisons of the current ShapedDevices.csv to prior loads (so CRM integration updates can be run frequently without duplicate actions).

Generating new file always takes few seconds - this is not a big deal though but makes some complications and much work for CPU and DB at CRM side.

Why You want this instead making this feature the way I proppose? What are Your thoughts about the way I proppose?

From my point of view its better to have a control on it (decide of adding ips/making new circuits/removing ips/removing circuits/changing params) in CRM than doing very so complicated&cpu consuming, unneeded in long term view (as we always do full reload at 4am) work. Your approach is also not API-friendly.

we can add queueCounter

We could also check queues (in system, for last node) and just add one bit ;)

In queuingStructure.json, we can add queueCounter to the end so the program can pick up where it left off in numbering queues. It will take some work but it can be done.

For simplicity: I proppose don't save statistics for hosts added by this function. If You want statistics or network topology changed just do full reload - this simple to remmember.

Those two approaches could be done together.

rchac commented 2 years ago

I'm currently working on v1.3-alpha. It now has the ability to take in a ShapedDevices.csv file, and update JUST the clients that were most recently changed since the last full reload (usually at 4AM).

On this release we are really trying to focus on optimizations for larger ISP networks so the software can scale.

I recognize and understand your point. I believe it's important long term to have this functionality to make sure continuous updates from various CRMs can be performed multiple times per day without API calls, which not all CRMs support equally. Plus it saves you API calls. =) You can just have some script generate your ShapedDevices.csv every few minutes. LibreQoS will now process it (without a full reload each time) and handle everything for you.

One important change between v1.2 and v1.3 is that CircuitID is now a required field in ShapedDevices.csv. It can just be the "client id" or "client location id" field from your CRM of course. It must be provided because it now serves as a unique identifier. That allows us to more efficiently process queues on larger networks.

It needs a bit more testing, but it's promising.

interduo commented 2 years ago

Plus it saves you API calls. =)

I've got unlimited of API calls but limited count of CPU.

Ok - for now It would be enough. The main thing is we dont want to have interruptions when doing reloading for single customer.

rchac commented 2 years ago

And with this, you will finally have that!

When you run LibreQoS.py with the --updateonly parameter it will do a partial reload (update changed entries only). You can also run scheduler.py, and it will do the partial reload every 30 mins.

Each client in the newest ShapedDevices.csv gets checked against the last loaded ShapedDevices.csv. If any client has been modified since the last refresh, it will update the appropriate tc class and tc qdisc objects, correct IP filters, and store a record of the changes made.

It was a huge pain in the butt to write but it seems to work consistently and allows for lots of dynamic changes to happen over the course of the day.

The only limitation is that if you need to change any aspect of the network.json (network topology, AP lists, etc) that requires a full reload. For your current flat topology that isn't a concern though.

So you can have your CRM parsing script output a ShapedDevices.csv every 30 mins or so, and LibreQoS will check against the new ShapedDevices.csv file and handle the rest.

dtaht commented 2 years ago

I have kind of lost track, is this dependent on being able to do "tc change" via the python routeing protocol stuff, or... ?

rchac commented 2 years ago

I tried using pyroute2 originally but it turned out pyroute2 lacks some functionality around HTB we need. So yes, it does use tc change here.

dtaht commented 2 years ago

Yea, this and rust need a lot more help at the protocols layers. Been trying to find the rust folk doing a port, but was unaware the python lib was inadaquate. Sigh. I'm also confused about breaking the 32k barrier, did you end up using namespaces for that?

rchac commented 2 years ago

Thankfully all it took was modifying one line in cpumap-pping/xdp-cpumap-tc. Shoutout to @thebracket for his help there and for coding a great solution in xdp-cpumap-tc and cpumap-pping. I tested in a lab VM and 50k+ filters work now with that one line modified.

Right now the 32k limit only applies to how many classes can be on each CPU core. The :minor class counter is now unique per CPU core now, so it doesn't hinder us like it used to. If you have 6 CPU cores, ~192,000 unique subscribers can exist. Of course the CPU bottleneck will be an issue well before that, but the 32k limit seems to be a non-concern now!

dtaht commented 2 years ago

It would be nice to find someone with 64 cores to spare for a while. I kind of strongly suspect there will be a LOT of used gear coming onto the ebay market in the coming months. What is so special about the Xeon Gold? Certainly TONS of cache (and some measurements of it) is always good. Ironically, it seems systemd!! is eating up the most memory on top. On a fresh start, how much memory is used on boot?, then with 10k subs, measured via free, would suffice.

rchac commented 2 years ago

At around 400 subscribers, this is what we see. image I'll try to get more info from larger deployments. Curious if the systemd memory use is related to this or something similar?

thebracket commented 2 years ago

The last commit to cpumap-pping adds a "maximums.h" that lets you set the hash-size limit once, and scales everything including the TCP RTT tracking buffers. I'm still testing it, but it looks good on this end.

On Thu, Oct 27, 2022 at 4:20 PM Robert Chacón @.***> wrote:

At around 400 subscribers, this is what we see. [image: image] https://user-images.githubusercontent.com/22501920/198400020-f818bde3-17a9-4a00-b4d2-69fc49d24cbd.png I'll try to get more info from larger deployments. Curious if the systemd memory use is related to this https://github.com/rchac/LibreQoS/issues/30 or something similar?

— Reply to this email directly, view it on GitHub https://github.com/rchac/LibreQoS/issues/49#issuecomment-1294076409, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRU4336I7BLAE5K3PKMJWTWFLW3HANCNFSM5766RGFA . You are receiving this because you were mentioned.Message ID: @.***>

dtaht commented 2 years ago

https://twitter.com/mtaht/status/1585743263478079490

dtaht commented 1 year ago

We're looking done here. Good job! See also #153 - how do we know we've truly cracked the 32k user barrier? Same goes for ipv6 testing?