Open interduo opened 2 months ago
The network traffic is going on but QoS policies are not applied to network traffic.
root@libreqos-beta:~# uname -a
Linux libreqos-beta 6.8.0-38-generic #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 7 15:25:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
00:10.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Subsystem: Mellanox Technologies MT27700 Family [ConnectX-4]
00:11.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Subsystem: Mellanox Technologies MT27700 Family [ConnectX-4]
NICs was tested on older LibreQoS versions on ubuntu 22.04 it was working like a charm.
First look:
root@libreqos-beta:~# tc class show dev ens16np0 | wc -l
5693
root@libreqos-beta:~# tc class show dev ens17np0 | wc -l
5581
I actually saw something like this for the first time last night. A LibreQoS.py
run took 14 minutes to complete, where it used to be maybe a second. I wasn't in a setup where I could do much real diagnosis, so I'll be trying to recreate it and figure out what's getting snarled up. I'm not sure if the lqosd
trace is all that useful - because it uses Futexes a lot (they are the base behind Rust's Mutex
type, and hard to avoid for any kind of concurrent setup).
It doesn't want to recreate on my local setup, which is going to make this a harder one to debug.
I can give You my proxmox VM backup if You suspects that I could something do wrong. The most interesting thing is that if i redirect network traffic the reloading go sucessfully in a while. This bug doesn't occur on libreqos without network traffic passing.
I wonder if ProxMox is the common factor here? Mine was also in ProxMox, passing about 1gbps at the time. It did eventually complete. I'll dive into this as soon as the coffee has done something.
I was carrying ~6gbps sumarized network throughput on iface when run ./LibreQoS.py from console.
This one is definitely going to be tricky. It's early morning and our traffic is pretty low (~ 400 mbps) and it ran without hiccups on the live box. (It also ran on my local system with about a gigabit of iperf
traffic being forced through it).
From timing the parts, it seemed like the longest delays were in:
Executing XDP-CPUMAP-TC IP filter commands
Executed 1281 XDP-CPUMAP-TC IP filter commands
(Not terrible, but enough that I was surprised to see it waiting - it didn't used to slow down there). Will investigate further.
Update: Running it again shows that there's a really big delay there that didn't used to be there. So at least now I have a candidate to examine.
I've identified the issue. The "hot cache" was being invalidated after every single IP mapping change, rather than once at the end (you have to invalidate it for changes to appear). So I'm in the process of changing the workflow slightly to explicitly flush at the end. My local test (hacked together rather than nice, shareable code) saw a MASSIVE improvement in reload times doing this.
Did You try to load 50K circuits? Maybe there is more places to improve?
Its not so important. More important is "no packet loss during reload".
My bet would be you need to run LibreQoS.py and check the output?
I did it. Nothing strange see then - It even creates tc classes and tc qdisc. Ok - i would do "round two" on clean OS from ISO and using newly added fixes in develop branch.
Tested - now it doesnt hang.
XDP filters: 0.0256 seconds
But I have a message:
Jul 12 08:40:25 libreqos-beta python3[13342]: /opt/libreqos/src/scheduler.py:62: UserWarning: Some devices were not shaped. Please check to ensure they have a valid ParentNode list>
Jul 12 08:40:25 libreqos-beta python3[13342]: refreshShapers()
What I can do about it? I got a flat network architecture.
Double check that you didn't put anything in a parent node that shouldn't be there; I'll be glad to take a look otherwise (I have a "flat" test setup, but don't touch it often - none of my networks are even remotely flat!). If you want, fire up the lqos_support_tool
and send a support dump. (Sorry for the edits, my spelling is bad this morning)
Double check that you didn't put anything in a parent node that shouldn't be there;
I don't understand.
root@libreqos-beta:/opt/libreqos/src# cat network.json
{}
I submited dump with lqos_support_tool. Edit as much as You want if message is last in issue.
UserWarning: Some devices were not shaped.
this message is saying too less
Suggestion: present line of ShapedDevices.csv and/or CircuitID+DeviceID
Thanks for the support dump (I love that new tool!). I don't see anything jumping out in the Shaped Devices list - so I'm going to assume that there's a bug to chase down in the flat network handler in LibreQoS.py
. I'll hammer on that later today. Thanks!
(I'm assuming "KOMENTARZ" means comment?)
Yes - KOMENTARZ means comment, on production this is replaced with some circuit description/identity
Got bad news. This is unfixed even with https://github.com/LibreQoE/LibreQoS/pull/520 It stops reloading in the same moment.
UserWarning: Some devices were not shaped.
this message is saying too less Suggestion: present line of ShapedDevices.csv and/or CircuitID+DeviceID
@thebracket could You add more info to this error?
Checked again - this bug exists also in newly released -beta2
Queue and IP filter reload completed in 71.3 seconds
TC commands: 5.9 seconds
XDP setup: 61.9 seconds
XDP filters: 0.2948 seconds
refreshShapers completed on 02/08/2024 14:33:43
When i stop passing network traffic it ends reloading very quickly.
Aug 05 10:12:55 libreqos-beta lqosd[985]: [2024-08-05T08:12:55Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:12:56 libreqos-beta lqosd[985]: [2024-08-05T08:12:56Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:12:57 libreqos-beta lqosd[985]: [2024-08-05T08:12:57Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:12:58 libreqos-beta lqosd[985]: [2024-08-05T08:12:58Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:12:59 libreqos-beta lqosd[985]: [2024-08-05T08:12:59Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:00 libreqos-beta lqosd[985]: [2024-08-05T08:13:00Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:01 libreqos-beta lqosd[985]: [2024-08-05T08:13:01Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:02 libreqos-beta lqosd[985]: [2024-08-05T08:13:02Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:03 libreqos-beta lqosd[985]: [2024-08-05T08:13:03Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:04 libreqos-beta lqosd[985]: [2024-08-05T08:13:04Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:05 libreqos-beta lqosd[985]: [2024-08-05T08:13:05Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:06 libreqos-beta lqosd[985]: [2024-08-05T08:13:06Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:07 libreqos-beta lqosd[985]: [2024-08-05T08:13:07Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:08 libreqos-beta lqosd[985]: [2024-08-05T08:13:08Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:09 libreqos-beta lqosd[985]: [2024-08-05T08:13:09Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:10 libreqos-beta lqosd[985]: [2024-08-05T08:13:10Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:11 libreqos-beta lqosd[985]: [2024-08-05T08:13:11Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:12 libreqos-beta lqosd[985]: [2024-08-05T08:13:12Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
Aug 05 10:13:13 libreqos-beta lqosd[985]: [2024-08-05T08:13:13Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
I found that in my logs
That's been there since lqosd existed - it just means the queues haven't been made yet, and there's nothing useful to read from a pfifo queue that's there by default.
Is the "hangs" still an issue?
On Mon, Aug 5, 2024, 3:15 AM Jarosław Kłopotek - INTERDUO < @.***> wrote:
Aug 05 10:12:55 libreqos-beta lqosd[985]: [2024-08-05T08:12:55Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:12:56 libreqos-beta lqosd[985]: [2024-08-05T08:12:56Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:12:57 libreqos-beta lqosd[985]: [2024-08-05T08:12:57Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:12:58 libreqos-beta lqosd[985]: [2024-08-05T08:12:58Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:12:59 libreqos-beta lqosd[985]: [2024-08-05T08:12:59Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:00 libreqos-beta lqosd[985]: [2024-08-05T08:13:00Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:01 libreqos-beta lqosd[985]: [2024-08-05T08:13:01Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:02 libreqos-beta lqosd[985]: [2024-08-05T08:13:02Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:03 libreqos-beta lqosd[985]: [2024-08-05T08:13:03Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:04 libreqos-beta lqosd[985]: [2024-08-05T08:13:04Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:05 libreqos-beta lqosd[985]: [2024-08-05T08:13:05Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:06 libreqos-beta lqosd[985]: [2024-08-05T08:13:06Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:07 libreqos-beta lqosd[985]: [2024-08-05T08:13:07Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:08 libreqos-beta lqosd[985]: [2024-08-05T08:13:08Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:09 libreqos-beta lqosd[985]: [2024-08-05T08:13:09Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:10 libreqos-beta lqosd[985]: [2024-08-05T08:13:10Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:11 libreqos-beta lqosd[985]: [2024-08-05T08:13:11Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:12 libreqos-beta lqosd[985]: [2024-08-05T08:13:12Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast Aug 05 10:13:13 libreqos-beta lqosd[985]: [2024-08-05T08:13:13Z WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
I found that in my logs
— Reply to this email directly, view it on GitHub https://github.com/LibreQoE/LibreQoS/issues/518#issuecomment-2268449764, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRU432BGFVYU7EK4URDOKTZP4YALAVCNFSM6AAAAABKW3I4IOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRYGQ2DSNZWGQ . You are receiving this because you were mentioned.Message ID: @.***>
Yes, but only when big traffic is going through libreqos
That's been there since lqosd existed - it just means the queues haven't been made yet, and there's nothing useful to read from a pfifo queue that's there by default.
No - in v1.4 I dont have such warning msg and there was lqosd. Dont know how to parse for me means "script dont understand" - maybe we should precise msg a little?
Tried to reload without traffic - goes ok but the traffic is not shaped.
I am really confused and don't know what to check more.
@rchac I recorded video: http://kłopotek.pl/lqos/screen-recording-06082024.mp4
That's been there since lqosd existed - it just means the queues haven't been made yet, and there's nothing useful to read from a pfifo queue that's there by default.
No - in v1.4 I dont have such warning msg and there was lqosd. Dont know how to parse for me means "script dont understand" - maybe we should precise msg a little?
Tried to reload without traffic - goes ok but the traffic is not shaped.
I am really confused and don't know what to check more.
One problem is out (reloading during big network traffic) after https://github.com/LibreQoE/LibreQoS/pull/545
Second one is still an issue (no traffic shaping and message WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast
).
I got this message every reload in my journalctl -u lqosd
.
For the 1,000,000,000th time, that warning message isn't a bug.
On Mon, Aug 12, 2024, 6:42 AM Jarosław Kłopotek - INTERDUO < @.***> wrote:
That's been there since lqosd existed - it just means the queues haven't been made yet, and there's nothing useful to read from a pfifo queue that's there by default.
No - in v1.4 I dont have such warning msg and there was lqosd. Dont know how to parse for me means "script dont understand" - maybe we should precise msg a little?
Tried to reload without traffic - goes ok but the traffic is not shaped.
I am really confused and don't know what to check more.
One problem is out (reloading during big network traffic) after #545 https://github.com/LibreQoE/LibreQoS/pull/545
Second one is still an issue (no traffic shaping and message "WARN lqos_queue_tracker::queue_types] I don't know how to parse qdisc type pfifo_fast").
— Reply to this email directly, view it on GitHub https://github.com/LibreQoE/LibreQoS/issues/518#issuecomment-2283744522, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRU4367UPJQ6H6HV4MIFI3ZRCNRHAVCNFSM6AAAAABKW3I4IOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBTG42DINJSGI . You are receiving this because you were mentioned.Message ID: @.***>
OK - but checked again and in v1.4 (on working VM) there wasn't such warn like You said ealier.
What to check then?
The reason you're seeing that message is that when it polls the queues, it's finding pfifo and not Cake - so the message isn't the issue. The question is, why don't you have any queues?
I'd start by going into your config, and changing this line back to 0
,
the default (or removing it):
override_available_queues = 26 # This can be omitted and be 0 for Python
lsmod | grep cake
show sch_cake
loaded? (That'd indicate that
Cake isn't installed)sudo tc -s qdisc show dev (ifname)
(replace
ifname
with your interface; ens16np0
and ens17np0
On Mon, Aug 12, 2024 at 7:03 AM Jarosław Kłopotek - INTERDUO < @.***> wrote:
OK - but checked again and in v1.4 (on working VM) there wasn't such warn like You said ealier.
What to check then?
— Reply to this email directly, view it on GitHub https://github.com/LibreQoE/LibreQoS/issues/518#issuecomment-2283785055, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRU432U6GXRSSDFQSEBN2DZRCQBFAVCNFSM6AAAAABKW3I4IOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBTG44DKMBVGU . You are receiving this because you were mentioned.Message ID: @.***>
I got queues loaded (qdisc and classes are on iface) but it looks like the xdp filters are not working. How to list them?
I:
sch_cake
is loaded (it is),override_available_queues = 0
,Testing and I get much packet loss :(
I get some data to diagnose: http://kłopotek.pl/lqos_beta_problem/
HW is good on second VM (with v1.4 and ubuntu 22.04) i use the same passthrough NICs and it works well.
The maximum network throughtput I could do with libre v1.5-beta2 was:
On almost empty VM the lqosd takes 26% of CPU core - is it normal thing?
Tried to set monitor_only = true
- less packet loss but 50% of network throughtput was avaiable.
Maybe there is something wrong with XDP bridge?
I will check tommorow if those problems exists in older ubuntu LTS (22.04) with newest LibreQoS.
I tried to install develop branch on ubuntu 22.04.
On older ubuntu 22.04 there was:
ens19: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
On newer 24.04:
ens19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
When compiling:
Compiling tower-http v0.5.2
warning: lqos_sys@0.1.0: [bpf skel] libbpf: elf: skipping unrecognized data section(11) .rodata.str1.1
Is it important?
Tried to run lqosd:
root@libreqos-beta-oldubuntu:/opt/libreqos/src/bin# ./lqosd
Error: Unable to load the XDP/TC kernel (-13)
I am trying to run lqosd on VirtioNIC's on older Ubuntu 22.04 for testing then test if it works on passthrough NICs.
Strace: http://kłopotek.pl/lqos_beta_problem/strace_loading_lqosd_on_olderubuntu
warning: lqos_sys@0.1.0: [bpf skel] libbpf: elf: skipping unrecognized data section(11) .rodata.str1.1
Is harmless. I just can't stop Linux from emitting it.
Do you have the hot-cache PR applied? I wonder if I exceeded the older instruction limit (the intent was not to require the newer kernel). I'm still hoping for a better solution than the one in that PR (which is why I haven't merged it).
I tested on develop + patch-33 + cherry-pick commit from #545
testbed1: ubuntu 24.04 testbed2: ubuntu 22.04
Is harmless. I just can't stop Linux from emitting it.
Ok - on ubuntu 24.04 there are no warns during compilation.
Remind me - Patch-33?
https://github.com/LibreQoE/LibreQoS/pull/505 patch-33 is branch name
Tested again with PR https://github.com/LibreQoE/LibreQoS/pull/547/commits. Reloading is OK (not hanging). If You merge PR #547 please close that issue I open another for next and maybe last problem.
Did You come back @thebracket?
I did, and straight to the world of the unwell (Daughter got a stomach bug, now I'm out with it)
I hope She is better now. Just give a ping when You got time.
I don't know is it related to this bug so I created next issue https://github.com/LibreQoE/LibreQoS/issues/549.
I installed LibreQoS Running ./LibreQoS.py gave me hang at:
checking deeper:
Running
./LibreQoS.py --debug
shows that:[here is hanging]
strace -p 1677
If I get out network traffic out of LibreQoS network interfaces (showdown vlan facing to the internet) it continue to reload and I could see: