LibreQoE / LibreQoS

A Quality of Experience and Smart Queue Management system for ISPs. Leverage CAKE to improve network responsiveness, enforce bandwidth plans, and reduce bufferbloat.
https://libreqos.io/
GNU General Public License v2.0
416 stars 47 forks source link

v1.5-beta: lqosd: overflow when subtracting durations #511

Open interduo opened 3 months ago

interduo commented 3 months ago

Before pushing network traffic (clear instance, interfaces not up) I see some errors in journalctl:

journalctl -u lqosd shows me:

Jul 03 09:29:19 libreqos-beta lqosd[988]: overflow when subtracting durations
Jul 03 09:29:19 libreqos-beta lqosd[988]: [2024-07-03T07:29:19Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(96), ...)
Jul 03 09:29:20 libreqos-beta lqosd[988]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 03 09:29:20 libreqos-beta lqosd[988]: overflow when subtracting durations
Jul 03 09:29:20 libreqos-beta lqosd[988]: [2024-07-03T07:29:20Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(99), ...)
Jul 03 09:29:21 libreqos-beta lqosd[988]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 03 09:29:21 libreqos-beta lqosd[988]: overflow when subtracting durations
Jul 03 09:29:21 libreqos-beta lqosd[988]: [2024-07-03T07:29:21Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(102), ...)
Jul 03 09:29:22 libreqos-beta lqosd[988]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 03 09:29:22 libreqos-beta lqosd[988]: overflow when subtracting durations
Jul 03 09:29:22 libreqos-beta lqosd[988]: [2024-07-03T07:29:22Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(105), ...)
Jul 03 09:29:23 libreqos-beta lqosd[988]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 03 09:29:23 libreqos-beta lqosd[988]: overflow when subtracting durations
Jul 03 09:29:23 libreqos-beta lqosd[988]: [2024-07-03T07:29:23Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(108), ...)
Jul 03 09:29:24 libreqos-beta lqosd[988]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 03 09:29:24 libreqos-beta lqosd[988]: overflow when subtracting durations
Jul 03 09:29:24 libreqos-beta lqosd[988]: [2024-07-03T07:29:24Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(111), ...)
Jul 03 09:29:25 libreqos-beta lqosd[988]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 03 09:29:25 libreqos-beta lqosd[988]: overflow when subtracting durations
Jul 03 09:29:25 libreqos-beta lqosd[988]: [2024-07-03T07:29:25Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(114), ...)
Jul 03 09:29:26 libreqos-beta lqosd[988]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 03 09:29:26 libreqos-beta lqosd[988]: overflow when subtracting durations
Jul 03 09:29:26 libreqos-beta lqosd[988]: [2024-07-03T07:29:26Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(117), ...)
thebracket commented 3 months ago

Are you continuing to see this? I've poked through logs on 3 different deployments, and not seen it. I also don't have any unconfigured systems, though - so maybe that's related?

It was a bug a while back, but I thought we'd caught all of those.

interduo commented 3 months ago

I will check that again building from scratch tommorow. It would be nice if You would merge my PRs. It repairs some other small bugs.

If You would like i can give You access on my testbed.

thebracket commented 3 months ago

It would be nice if You would merge my PRs.

I'll go through them when I'm back from my travels.

interduo commented 2 months ago

After doing: git clone -b develop git@github.com:LibreQoE/LibreQoS.git git remote add interduo git@github.com:interduo/LibreQoS.git git fetch interduo git cherry-pick 5e22b0866e843a683a6f32e13fe8f18556b7631c f443472b005fe5bc7372fbca6a7e4e910f42de07 425af46d887fa2a0c5a091b87a71a1221de3eabd ./build_dpkg.sh dpkg -i libreqos_1.5-BETA1.202407090816-1_amd64.deb

The problem is off, so I suspects that this is a problem of beta package libreqos_1.5.202406251802-1_amd64.deb only.

interduo commented 2 months ago

Ohhh the problem apperas only when boot when lqosd is starting for first time. Maybe lqosd tries starts too early?

Jul 09 11:47:52 libreqos-beta lqosd[976]: [2024-07-09T09:47:52Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(108), ...)
Jul 09 11:47:53 libreqos-beta lqosd[976]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 09 11:47:53 libreqos-beta lqosd[976]: overflow when subtracting durations
Jul 09 11:47:53 libreqos-beta lqosd[976]: [2024-07-09T09:47:53Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(111), ...)
Jul 09 11:47:54 libreqos-beta lqosd[976]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 09 11:47:54 libreqos-beta lqosd[976]: overflow when subtracting durations
Jul 09 11:47:54 libreqos-beta lqosd[976]: [2024-07-09T09:47:54Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(114), ...)
Jul 09 11:47:55 libreqos-beta lqosd[976]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 09 11:47:55 libreqos-beta lqosd[976]: overflow when subtracting durations
Jul 09 11:47:55 libreqos-beta lqosd[976]: [2024-07-09T09:47:55Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(117), ...)
Jul 09 11:48:44 libreqos-beta lqosd[976]: [2024-07-09T09:48:44Z WARN  lqosd] Terminating on SIGTERM

I had a bad iface name in netplan. The interface was not up. Maybe we should check if iface is up?

Also find out when booting: image

interduo commented 2 months ago

Ok there is more debug info in dmesg:

[    9.642699] mlx5_core 0000:00:10.0 ens16np0: Link up
[    9.838999] mlx5_core 0000:00:11.0 ens17np0: Link up
[   17.868271] ------------[ cut here ]------------
[   17.868279] UBSAN: array-index-out-of-bounds in /build/linux-lCLmOi/linux-6.8.0/kernel/bpf/lpm_trie.c:194:14
[   17.868332] index 8 is out of range for type '__u8 [*]'
[   17.868350] CPU: 18 PID: 1035 Comm: tokio-runtime-w Not tainted 6.8.0-36-generic #36-Ubuntu
[   17.868353] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   17.868355] Call Trace:
[   17.868361]  <TASK>
[   17.868366]  dump_stack_lvl+0x48/0x70
[   17.868377]  dump_stack+0x10/0x20
[   17.868379]  __ubsan_handle_out_of_bounds+0xc6/0x110
[   17.868384]  longest_prefix_match.isra.0+0x1c9/0x210
[   17.868391]  trie_update_elem+0x11c/0x2b0
[   17.868394]  bpf_map_update_value+0xfc/0x260
[   17.868398]  map_update_elem+0x1fa/0x360
[   17.868401]  __sys_bpf+0x54c/0xa10
[   17.868406]  __x64_sys_bpf+0x1a/0x30
[   17.868408]  x64_sys_call+0x1936/0x25c0
[   17.868411]  do_syscall_64+0x7f/0x180
[   17.868417]  ? do_syscall_64+0x8c/0x180
[   17.868420]  ? do_syscall_64+0x8c/0x180
[   17.868423]  ? irqentry_exit_to_user_mode+0x7b/0x260
[   17.868453]  ? irqentry_exit+0x43/0x50
[   17.868455]  ? exc_page_fault+0x94/0x1b0
[   17.868457]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[   17.868463] RIP: 0033:0x79055032725d
[   17.868485] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48
[   17.868487] RSP: 002b:000079053b5fc4c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
[   17.868490] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 000079055032725d
[   17.868492] RDX: 0000000000000020 RSI: 000079053b5fc4d0 RDI: 0000000000000002
[   17.868493] RBP: 000079053b5fdb10 R08: 0000790523e0e240 R09: 0000000000000040
[   17.868495] R10: 000079054ff848c0 R11: 0000000000000246 R12: 0000000000000000
[   17.868496] R13: 000079053b5fc6c1 R14: 0000000000000019 R15: 001a000700000080
[   17.868499]  </TASK>
[   17.868500] ---[ end trace ]---
[   94.536187] hrtimer: interrupt took 877 ns
[  215.693776] show_signal_msg: 95 callbacks suppressed
[  215.693781] systemd-network[1548]: segfault at 7ffe94f97ff8 ip 00007e7ede9c40a8 sp 00007ffe94f98000 error 6 in libsystemd-shared-255.so[7e7ede86a000+247000] likely on CPU 5 (core 5, socket 0)
[  215.693801] Code: 35 ce eb 0f 00 48 8d 3d 8e d2 12 00 e8 41 64 00 00 e8 ac a2 ea ff e8 17 fd ff ff 0f 1f 80 00 00 00 00 f3 0f 1e fa 55 48 89 e5 <41> 55 49 89 d5 41 54 49 89 cc 53 48 89 fb 48 83 ec 08 e8 51 77 ff
[  532.122814] systemd-network[1745]: segfault at 7ffe69fcfff8 ip 00007305379c40a8 sp 00007ffe69fd0000 error 6 in libsystemd-shared-255.so[73053786a000+247000] likely on CPU 5 (core 5, socket 0)
[  532.122835] Code: 35 ce eb 0f 00 48 8d 3d 8e d2 12 00 e8 41 64 00 00 e8 ac a2 ea ff e8 17 fd ff ff 0f 1f 80 00 00 00 00 f3 0f 1e fa 55 48 89 e5 <41> 55 49 89 d5 41 54 49 89 cc 53 48 89 fb 48 83 ec 08 e8 51 77 ff

If I pass network traffic through LibreQoS it doesn't work. 80% packet loss.

Ubuntu 24.04 LTS updated to newest package version available.

interduo commented 2 months ago

Found: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2046569

thebracket commented 2 months ago

The ubsan part isn't us, so I'm not going to try and fix that. The duration subtraction definitely needs fixing (it typically only happens if you booted quickly; for whatever reason the system clock gives some odd results for the first second). I've already implemented a fix in a couple of places, and recently found a better one - so I'll get this for the next beta.

thebracket commented 2 months ago

I've added a commit that should fix the timing portion of this (there's not a lot we can do about errors polling devices that aren't up - report and move on is probably the best we can do there).

We do warn on NICs not being up on startup, but when we tested having it as an error it caused too many problems (mostly relating to setups that bring up NICs later).

interduo commented 2 months ago

Found: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2046569

Related also: https://lore.kernel.org/lkml/202402211347.2AF2EC4621@keescook/t/

So mainly we are waiting for commit that resolve this bug on ubuntu 24.04, till this time we are stuck at older LTS.

interduo commented 2 months ago

If You somehow start sucessfully lqos on ubuntu 24.04 just let me know

thebracket commented 2 months ago

I'm really confused here. It's running in my 24.04 VM just fine, live on 24.04 in at least 2 places that I know of? And you're linking to bugs in a different piece of software?

interduo commented 2 months ago

We don't use lpm_trie.c ?

interduo commented 2 months ago

I will do next try from scratch in next few days on testbed.

interduo commented 2 months ago
Jul 11 15:29:57 libreqos-beta lqosd[940]: [2024-07-11T13:29:57Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(46), ...)
Jul 11 15:29:58 libreqos-beta lqosd[940]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 11 15:29:58 libreqos-beta lqosd[940]: overflow when subtracting durations
Jul 11 15:29:58 libreqos-beta lqosd[940]: [2024-07-11T13:29:58Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(49), ...)
Jul 11 15:29:59 libreqos-beta lqosd[940]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 11 15:29:59 libreqos-beta lqosd[940]: overflow when subtracting durations
Jul 11 15:29:59 libreqos-beta lqosd[940]: [2024-07-11T13:29:59Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(52), ...)
Jul 11 15:30:00 libreqos-beta lqosd[940]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 11 15:30:00 libreqos-beta lqosd[940]: overflow when subtracting durations
Jul 11 15:30:00 libreqos-beta lqosd[940]: [2024-07-11T13:30:00Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(55), ...)
root@libreqos-beta:~# dpkg -l libreqos
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version                Architecture Description
+++-==============-======================-============-===================================
ii  libreqos       1.5-BETA1.202407111152 amd64        CAKE-based traffic shaping for ISPs

Package was compiled by me using develop branch + commits from interduo/patch-33.

interduo commented 2 months ago

Tried on last state of develop with interduo/patch-33 fixes.

Jul 12 09:04:44 libreqos-beta systemd[1]: Started lqosd.service.
Jul 12 09:04:45 libreqos-beta lqosd[937]: thread '<unnamed>' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:45 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:45 libreqos-beta lqosd[937]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Jul 12 09:04:45 libreqos-beta lqosd[937]: ["192.168.66.0", "24"]
Jul 12 09:04:45 libreqos-beta lqosd[937]: [2024-07-12T07:04:45Z WARN  lqos_bus::bus::unix_socket_server] Listening on: /run/lqos/bus
Jul 12 09:04:45 libreqos-beta lqosd[937]: ::ffff:192.168.66.0 120
Jul 12 09:04:45 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:45 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:45 libreqos-beta lqosd[937]: [2024-07-12T07:04:45Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(34), ...)
Jul 12 09:04:45 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:45 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:45 libreqos-beta lqosd[937]: [2024-07-12T07:04:45Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(36), ...)
Jul 12 09:04:46 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:46 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:46 libreqos-beta lqosd[937]: [2024-07-12T07:04:46Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(39), ...)
Jul 12 09:04:47 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:47 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:47 libreqos-beta lqosd[937]: [2024-07-12T07:04:47Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(42), ...)
Jul 12 09:04:48 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:48 libreqos-beta lqosd[937]: overflow when subtracting durations
Jul 12 09:04:48 libreqos-beta lqosd[937]: [2024-07-12T07:04:48Z ERROR lqosd::throughput_tracker] Error polling network. JoinError::Panic(Id(46), ...)
Jul 12 09:04:49 libreqos-beta lqosd[937]: thread 'tokio-runtime-worker' panicked at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/time.rs:1126:31:
Jul 12 09:04:49 libreqos-beta lqosd[937]: overflow when subtracting durations
interduo commented 2 months ago

How to reproduce warnings: Just disconnect media (from proxmox, checkbox disconnected) / plug out cable. Start VM. Run lqosd.

thebracket commented 2 months ago

On the current develop branch, I deleted my veth devices while lqosd was running - no duration crash. Same with setting the veth devices to down. It spits out a bunch of warnings (since it's really not designed to be trying to poll data from dead devices) - but nothing crashes?

interduo commented 2 months ago

I will do next round of test on Friday and give You access if fails. Warnings are not needed here (when RUST_BACKTRACE is NOT 1).

interduo commented 2 months ago

Checked my case more deeply: My suggestion: if devices from /etc/lqos.conf (variables: to_internet & to_network) are not existing in system we should rise an error and ommit initializing/reloading. Probably this would stop that kind of warnings.