LibreQoE / LibreQoS

A Quality of Experience and Smart Queue Management system for ISPs. Leverage CAKE to improve network responsiveness, enforce bandwidth plans, and reduce bufferbloat.
https://libreqos.io/
GNU General Public License v2.0
415 stars 47 forks source link

v1.5-beta: problems with starting services in propper order/time/delay #521

Open interduo opened 2 months ago

interduo commented 2 months ago

Aftert reboot there is always problem:

Jul 12 09:04:44 libreqos-beta systemd[1]: Started lqos_node_manager.service.
Jul 12 09:04:44 libreqos-beta lqos_node_manager[938]: Rocket has launched from http://[::]:9123
Jul 12 09:04:44 libreqos-beta lqos_node_manager[938]: Error: Unable to access /run/lqos/bus. Check that lqosd is running and you have appropriate permissions.

After systemctl restart lqos_node_manger all is starting perfectly.

Solution 1 "temporary": add ExecStartPre=/bin/sleep 10 in systemd service unit file

cat /etc/systemd/system/lqos_node_manager.service

[Unit]
After=network.service lqosd.service
Requires=lqosd.service

[Service]
WorkingDirectory=/opt/libreqos/src/bin
ExecStartPre=/bin/sleep 10
ExecStart=/opt/libreqos/src/bin/lqos_node_manager
Restart=always
#Turn on debuging for service
#Environment=RUST_LOG=info

[Install]
WantedBy=default.target

Solution 2: propper way fix, use service notify type

  1. Set lqosd Service Type as Notify
  2. Send a message (via sd_notify) after full start of lqosd that the service is ready.

What do You think about it?

interduo commented 2 months ago

This is also lqos_scheduler problem:

What I did:

journalctl --vacuum-time=15min --rotate
reboot
journalctl -u lqos_scheduler
-- Boot b0a56b4b200144f3802715e47c588a83 --
Jul 12 09:24:31 libreqos-beta systemd[1]: Starting lqos_scheduler.service...
Jul 12 09:24:31 libreqos-beta python3[943]: thread '<unnamed>' panicked at lqos_python/src/lib.rs:269:70:
Jul 12 09:24:31 libreqos-beta python3[943]: called `Result::unwrap()` on an `Err` value: Socket (typically /run/lqos/bus) not found. Check that lqosd is running, and you have permi>
Jul 12 09:24:31 libreqos-beta python3[943]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Jul 12 09:24:31 libreqos-beta python3[943]: Running Python Version 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0]
Jul 12 09:24:31 libreqos-beta python3[943]: refreshShapers starting at 12/07/2024 09:24:31
Jul 12 09:24:31 libreqos-beta python3[943]: First time run since system boot.
Jul 12 09:24:31 libreqos-beta python3[943]: Validating input files 'ShapedDevices.csv' and 'network.json'
Jul 12 09:24:33 libreqos-beta python3[943]: Traceback (most recent call last):
Jul 12 09:24:33 libreqos-beta python3[943]:   File "/opt/libreqos/src/scheduler.py", line 69, in <module>
Jul 12 09:24:33 libreqos-beta python3[943]:     importAndShapeFullReload()
Jul 12 09:24:33 libreqos-beta python3[943]:   File "/opt/libreqos/src/scheduler.py", line 62, in importAndShapeFullReload
Jul 12 09:24:33 libreqos-beta python3[943]:     refreshShapers()
Jul 12 09:24:33 libreqos-beta python3[943]:   File "/opt/libreqos/src/LibreQoS.py", line 448, in refreshShapers
Jul 12 09:24:33 libreqos-beta python3[943]:     if (validateNetworkAndDevices() == True):
Jul 12 09:24:33 libreqos-beta python3[943]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 12 09:24:33 libreqos-beta python3[943]:   File "/opt/libreqos/src/LibreQoS.py", line 130, in validateNetworkAndDevices
Jul 12 09:24:33 libreqos-beta python3[943]:     rustValid = validate_shaped_devices()
Jul 12 09:24:33 libreqos-beta python3[943]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 12 09:24:33 libreqos-beta python3[943]: pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Socket (typically /run/lqos/bus) not found. Check that lqosd i>
Jul 12 09:24:33 libreqos-beta systemd[1]: lqos_scheduler.service: Main process exited, code=exited, status=1/FAILURE
Jul 12 09:24:41 libreqos-beta systemd[1]: lqos_scheduler.service: Failed with result 'exit-code'.
Jul 12 09:24:41 libreqos-beta systemd[1]: Failed to start lqos_scheduler.service.
Jul 12 09:24:41 libreqos-beta systemd[1]: lqos_scheduler.service: Consumed 2.281s CPU time.
Jul 12 09:24:41 libreqos-beta systemd[1]: lqos_scheduler.service: Scheduled restart job, restart counter is at 1.
Jul 12 09:24:41 libreqos-beta systemd[1]: Starting lqos_scheduler.service...

Setting ExecStartPre=/bin/sleep 60 in lqos_scheduler.service helps for that

interduo commented 2 months ago

Temporary solution: https://github.com/LibreQoE/LibreQoS/pull/522/

Don't requires implementing anything in lqosd.

thebracket commented 2 months ago

The good news is that with UI2, there's no more rocket or separate node_manager daemon - so the Rocket side of things is going away. The scheduler needs to do an "is lqosd running? If not, delay" check - that should be easy enough.

interduo commented 2 months ago

Well this should be done on systemd level I think. It was creates for also this.

interduo commented 3 days ago

Ok the situation now is that: scheduler not started because no lqosd lqosd not started because qsfp+ not up (sometimes it is negotiating connection few secs) scheduler give up and throw error in dmesg that it could not be started. I started manually started lqosd then scheduler.

Lqos_scheduler schould check link state before checking lqosd (?) if interfaces are not up just sleep some time and check again.