freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
553 stars 325 forks source link

[v2023.1.1] Tunneldigger causes high CPU Load when enabled but no WAN connection available #3024

Open dariks opened 1 year ago

dariks commented 1 year ago

Bug report

What is the problem? With enabled Tunneldigger but no WAN connection (e.g. Cable is disconnected on WAN) the node is experiencing high CPU Load Average (1.5-2.0) because of Tunneldigger's effort to restart and reconnect. This also causes super high flowrate of logs. At the moment we have recreated this issue on TP-Link Archer C7 v2, Archer C7 v5 and Xiaomi Mi 4a Gigabit, so this seems to be independent of the device.

Log output (this log absolutely SPAMS logread...):

Sun Oct 15 14:35:46 2023 daemon.err tunneldigger[10258]: td-client: [gw05.ffnordhessen.net:20008] Reinitializing tunnel context.
Sun Oct 15 14:35:46 2023 daemon.err td-client: [gw05.ffnh.net:20008] Failed to connect to remote endpoint - check WAN connectivity!
Sun Oct 15 14:35:46 2023 daemon.err tunneldigger[10258]: td-client: [gw05.ffnh.net:20008] Failed to connect to remote endpoint - check WAN connectivity!
Sun Oct 15 14:35:46 2023 daemon.err td-client: [gw05.ffnordhessen.net:20008] Failed to connect to remote endpoint - check WAN connectivity!
Sun Oct 15 14:35:46 2023 daemon.info td-client: [gw05.ffnh.net:20008] Reinitializing tunnel context.
Sun Oct 15 14:35:46 2023 daemon.err tunneldigger[10258]: td-client: [gw05.ffnordhessen.net:20008] Failed to connect to remote endpoint - check WAN connectivity!
Sun Oct 15 14:35:46 2023 daemon.err tunneldigger[10258]: td-client: [gw05.ffnh.net:20008] Reinitializing tunnel context.
Sun Oct 15 14:35:46 2023 daemon.err td-client: No suitable brokers found. Retrying in 5 seconds
Sun Oct 15 14:35:46 2023 daemon.err tunneldigger[10258]: td-client: No suitable brokers found. Retrying in 5 seconds

Devices with said issues: https://map.freifunk-nordhessen.de/#!/de/map/60e32727c0fe https://map.freifunk-nordhessen.de/#!/de/map/8cdef9a91277

What is the expected behaviour? Tunneldigger should retry, but it could make sense to reduce the number of retries after X amount of tries (e.g. every 5 minutes after 10 fails).

Gluon Version: Gluon v2023.1.1 seems like Commit ID 7feebe96a6b93d567bd342ffd01abcb3022383c5

Site Configuration: https://gitlab.com/freifunk-nordhessen/ffnh-site

Custom patches: no custom patches applied to Tunneldigger and/or dnsmasq-wan.

dariks commented 1 year ago

Link to Screen capture of the issue: https://rippen-cloud.de/index.php/s/XK3RztbWmMWHQDN

Screenshot of this issue:

Bildschirmfoto 2023-10-15 um 15 42 18

Screenshot of Grafana Monitoring when this issue occurs: ffnh-firmware-dri-03-high-load-tunneldigger

neocturne commented 1 year ago

Thanks for the report.

Is tunneldigger retrying more frequently than every 5 seconds as printed in the logs, or does it cause such high load even with when only retrying every 5 seconds? What is the load average when you stop tunneldigger (/etc/init.d/tunneldigger stop?

dariks commented 1 year ago

Thank you for taking your time to create such an useful tool.

It doesn't seem to run more often than in the 5 second interval. It rather seems as if the "trying to connect - erroring out ("Failed to connect to remote endpoint - check WAN connectivity!") - Reinitializing tunnel context - trying to connect next gateway" process is running way too quick and causing the load (take a look at the attached link to see a video of the issue.). It seems as if there should be some kind of rate limit at which the Tunneldigger should connect. I can confirm running /etc/init.d/tunneldigger stop mitigates the high load average for a short period of time (Tunneldigger gets restartet automatically, probably by the Tunneldigger-Watchdog). The 1min load average dropped to 0.02 after 3 minutes of running with stopped Tunneldigger.