Open interduo opened 4 months ago
Are you continuing to see this? I've poked through logs on 3 different deployments, and not seen it. I also don't have any unconfigured systems, though - so maybe that's related?
It was a bug a while back, but I thought we'd caught all of those.
I will check that again building from scratch tommorow. It would be nice if You would merge my PRs. It repairs some other small bugs.
If You would like i can give You access on my testbed.
It would be nice if You would merge my PRs.
I'll go through them when I'm back from my travels.
After doing: git clone -b develop git@github.com:LibreQoE/LibreQoS.git git remote add interduo git@github.com:interduo/LibreQoS.git git fetch interduo git cherry-pick 5e22b0866e843a683a6f32e13fe8f18556b7631c f443472b005fe5bc7372fbca6a7e4e910f42de07 425af46d887fa2a0c5a091b87a71a1221de3eabd ./build_dpkg.sh dpkg -i libreqos_1.5-BETA1.202407090816-1_amd64.deb
The problem is off, so I suspects that this is a problem of beta package libreqos_1.5.202406251802-1_amd64.deb only.
Ohhh the problem apperas only when boot when lqosd is starting for first time. Maybe lqosd tries starts too early?
Jul 09 11:47:52 libreqos-beta lqosd[976]: [2024-07-09T09:47:52Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(108), ...)
Jul 09 11:47:53 libreqos-beta lqosd[976]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 09 11:47:53 libreqos-beta lqosd[976]: overflow when subtracting durations
Jul 09 11:47:53 libreqos-beta lqosd[976]: [2024-07-09T09:47:53Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(111), ...)
Jul 09 11:47:54 libreqos-beta lqosd[976]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 09 11:47:54 libreqos-beta lqosd[976]: overflow when subtracting durations
Jul 09 11:47:54 libreqos-beta lqosd[976]: [2024-07-09T09:47:54Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(114), ...)
Jul 09 11:47:55 libreqos-beta lqosd[976]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 09 11:47:55 libreqos-beta lqosd[976]: overflow when subtracting durations
Jul 09 11:47:55 libreqos-beta lqosd[976]: [2024-07-09T09:47:55Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(117), ...)
Jul 09 11:48:44 libreqos-beta lqosd[976]: [2024-07-09T09:48:44Z WARN lqosd] Terminating on SIGTERM
I had a bad iface name in netplan. The interface was not up. Maybe we should check if iface is up?
Also find out when booting:
Ok there is more debug info in dmesg:
[ 9.642699] mlx5_core 0000:00:10.0 ens16np0: Link up
[ 9.838999] mlx5_core 0000:00:11.0 ens17np0: Link up
[ 17.868271] ------------[ cut here ]------------
[ 17.868279] UBSAN: array-index-out-of-bounds in /build/linux-lCLmOi/linux-6.8.0/kernel/bpf/lpm_trie.c:194:14
[ 17.868332] index 8 is out of range for type '__u8 [*]'
[ 17.868350] CPU: 18 PID: 1035 Comm: tokio-runtime-w Not tainted 6.8.0-36-generic #36-Ubuntu
[ 17.868353] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 17.868355] Call Trace:
[ 17.868361] <TASK>
[ 17.868366] dump_stack_lvl+0x48/0x70
[ 17.868377] dump_stack+0x10/0x20
[ 17.868379] __ubsan_handle_out_of_bounds+0xc6/0x110
[ 17.868384] longest_prefix_match.isra.0+0x1c9/0x210
[ 17.868391] trie_update_elem+0x11c/0x2b0
[ 17.868394] bpf_map_update_value+0xfc/0x260
[ 17.868398] map_update_elem+0x1fa/0x360
[ 17.868401] __sys_bpf+0x54c/0xa10
[ 17.868406] __x64_sys_bpf+0x1a/0x30
[ 17.868408] x64_sys_call+0x1936/0x25c0
[ 17.868411] do_syscall_64+0x7f/0x180
[ 17.868417] ? do_syscall_64+0x8c/0x180
[ 17.868420] ? do_syscall_64+0x8c/0x180
[ 17.868423] ? irqentry_exit_to_user_mode+0x7b/0x260
[ 17.868453] ? irqentry_exit+0x43/0x50
[ 17.868455] ? exc_page_fault+0x94/0x1b0
[ 17.868457] entry_SYSCALL_64_after_hwframe+0x78/0x80
[ 17.868463] RIP: 0033:0x79055032725d
[ 17.868485] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48
[ 17.868487] RSP: 002b:000079053b5fc4c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
[ 17.868490] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 000079055032725d
[ 17.868492] RDX: 0000000000000020 RSI: 000079053b5fc4d0 RDI: 0000000000000002
[ 17.868493] RBP: 000079053b5fdb10 R08: 0000790523e0e240 R09: 0000000000000040
[ 17.868495] R10: 000079054ff848c0 R11: 0000000000000246 R12: 0000000000000000
[ 17.868496] R13: 000079053b5fc6c1 R14: 0000000000000019 R15: 001a000700000080
[ 17.868499] </TASK>
[ 17.868500] ---[ end trace ]---
[ 94.536187] hrtimer: interrupt took 877 ns
[ 215.693776] show_signal_msg: 95 callbacks suppressed
[ 215.693781] systemd-network[1548]: segfault at 7ffe94f97ff8 ip 00007e7ede9c40a8 sp 00007ffe94f98000 error 6 in libsystemd-shared-255.so[7e7ede86a000+247000] likely on CPU 5 (core 5, socket 0)
[ 215.693801] Code: 35 ce eb 0f 00 48 8d 3d 8e d2 12 00 e8 41 64 00 00 e8 ac a2 ea ff e8 17 fd ff ff 0f 1f 80 00 00 00 00 f3 0f 1e fa 55 48 89 e5 <41> 55 49 89 d5 41 54 49 89 cc 53 48 89 fb 48 83 ec 08 e8 51 77 ff
[ 532.122814] systemd-network[1745]: segfault at 7ffe69fcfff8 ip 00007305379c40a8 sp 00007ffe69fd0000 error 6 in libsystemd-shared-255.so[73053786a000+247000] likely on CPU 5 (core 5, socket 0)
[ 532.122835] Code: 35 ce eb 0f 00 48 8d 3d 8e d2 12 00 e8 41 64 00 00 e8 ac a2 ea ff e8 17 fd ff ff 0f 1f 80 00 00 00 00 f3 0f 1e fa 55 48 89 e5 <41> 55 49 89 d5 41 54 49 89 cc 53 48 89 fb 48 83 ec 08 e8 51 77 ff
If I pass network traffic through LibreQoS it doesn't work. 80% packet loss.
Ubuntu 24.04 LTS updated to newest package version available.
The ubsan part isn't us, so I'm not going to try and fix that. The duration subtraction definitely needs fixing (it typically only happens if you booted quickly; for whatever reason the system clock gives some odd results for the first second). I've already implemented a fix in a couple of places, and recently found a better one - so I'll get this for the next beta.
I've added a commit that should fix the timing portion of this (there's not a lot we can do about errors polling devices that aren't up - report and move on is probably the best we can do there).
We do warn on NICs not being up on startup, but when we tested having it as an error it caused too many problems (mostly relating to setups that bring up NICs later).
Found: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2046569
Related also: https://lore.kernel.org/lkml/202402211347.2AF2EC4621@keescook/t/
So mainly we are waiting for commit that resolve this bug on ubuntu 24.04, till this time we are stuck at older LTS.
If You somehow start sucessfully lqos on ubuntu 24.04 just let me know
I'm really confused here. It's running in my 24.04 VM just fine, live on 24.04 in at least 2 places that I know of? And you're linking to bugs in a different piece of software?
We don't use lpm_trie.c ?
I will do next try from scratch in next few days on testbed.
Jul 11 15:29:57 libreqos-beta lqosd[940]: [2024-07-11T13:29:57Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(46), ...)
Jul 11 15:29:58 libreqos-beta lqosd[940]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 11 15:29:58 libreqos-beta lqosd[940]: overflow when subtracting durations
Jul 11 15:29:58 libreqos-beta lqosd[940]: [2024-07-11T13:29:58Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(49), ...)
Jul 11 15:29:59 libreqos-beta lqosd[940]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 11 15:29:59 libreqos-beta lqosd[940]: overflow when subtracting durations
Jul 11 15:29:59 libreqos-beta lqosd[940]: [2024-07-11T13:29:59Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(52), ...)
Jul 11 15:30:00 libreqos-beta lqosd[940]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 11 15:30:00 libreqos-beta lqosd[940]: overflow when subtracting durations
Jul 11 15:30:00 libreqos-beta lqosd[940]: [2024-07-11T13:30:00Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(55), ...)
root@libreqos-beta:~# dpkg -l libreqos
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============-======================-============-===================================
ii libreqos 1.5-BETA1.202407111152 amd64 CAKE-based traffic shaping for ISPs
Package was compiled by me using develop branch + commits from interduo/patch-33.
Tried on last state of develop with interduo/patch-33 fixes.
Jul 12 09:04:44 libreqos-beta systemd[1]: Started lqosd.service.
Jul 12 09:04:45 libreqos-beta lqosd[937]: thread '<unnamed>' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:45 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:45 libreqos-beta lqosd[937]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Jul 12 09:04:45 libreqos-beta lqosd[937]: ["192.168.66.0", "24"]
Jul 12 09:04:45 libreqos-beta lqosd[937]: [2024-07-12T07:04:45Z WARN lqos_bus::bus::unix_socket_server] Listening on: /run/lqos/bus
Jul 12 09:04:45 libreqos-beta lqosd[937]: ::ffff:192.168.66.0 120
Jul 12 09:04:45 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:45 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:45 libreqos-beta lqosd[937]: [2024-07-12T07:04:45Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(34), ...)
Jul 12 09:04:45 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:45 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:45 libreqos-beta lqosd[937]: [2024-07-12T07:04:45Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(36), ...)
Jul 12 09:04:46 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:46 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:46 libreqos-beta lqosd[937]: [2024-07-12T07:04:46Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(39), ...)
Jul 12 09:04:47 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:47 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:47 libreqos-beta lqosd[937]: [2024-07-12T07:04:47Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(42), ...)
Jul 12 09:04:48 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:48 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:48 libreqos-beta lqosd[937]: [2024-07-12T07:04:48Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(46), ...)
Jul 12 09:04:49 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:49 libreqos-beta lqosd[937]: overflow when subtracting durations
How to reproduce warnings: Just disconnect media (from proxmox, checkbox disconnected) / plug out cable. Start VM. Run lqosd.
On the current develop
branch, I deleted my veth
devices while lqosd
was running - no duration crash. Same with setting the veth
devices to down
. It spits out a bunch of warnings (since it's really not designed to be trying to poll data from dead devices) - but nothing crashes?
I will do next round of test on Friday and give You access if fails. Warnings are not needed here (when RUST_BACKTRACE is NOT 1).
Checked my case more deeply: My suggestion: if devices from /etc/lqos.conf (variables: to_internet & to_network) are not existing in system we should rise an error and ommit initializing/reloading. Probably this would stop that kind of warnings.
Before pushing network traffic (clear instance, interfaces not up) I see some errors in journalctl:
journalctl -u lqosd shows me: