freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
544 stars 325 forks source link

[v2023.1.1] Tunneldigger causes high CPU Load when enabled but no WAN connection available #3024

Open dariks opened 11 months ago

dariks commented 11 months ago

Bug report

What is the problem? With enabled Tunneldigger but no WAN connection (e.g. Cable is disconnected on WAN) the node is experiencing high CPU Load Average (1.5-2.0) because of Tunneldigger's effort to restart and reconnect. This also causes super high flowrate of logs. At the moment we have recreated this issue on TP-Link Archer C7 v2, Archer C7 v5 and Xiaomi Mi 4a Gigabit, so this seems to be independent of the device.

Log output (this log absolutely SPAMS logread...):

Sun Oct 15 14:35:46 2023 daemon.err tunneldigger[10258]: td-client: [gw05.ffnordhessen.net:20008] Reinitializing tunnel context.
Sun Oct 15 14:35:46 2023 daemon.err td-client: [gw05.ffnh.net:20008] Failed to connect to remote endpoint - check WAN connectivity!
Sun Oct 15 14:35:46 2023 daemon.err tunneldigger[10258]: td-client: [gw05.ffnh.net:20008] Failed to connect to remote endpoint - check WAN connectivity!
Sun Oct 15 14:35:46 2023 daemon.err td-client: [gw05.ffnordhessen.net:20008] Failed to connect to remote endpoint - check WAN connectivity!
Sun Oct 15 14:35:46 2023 daemon.info td-client: [gw05.ffnh.net:20008] Reinitializing tunnel context.
Sun Oct 15 14:35:46 2023 daemon.err tunneldigger[10258]: td-client: [gw05.ffnordhessen.net:20008] Failed to connect to remote endpoint - check WAN connectivity!
Sun Oct 15 14:35:46 2023 daemon.err tunneldigger[10258]: td-client: [gw05.ffnh.net:20008] Reinitializing tunnel context.
Sun Oct 15 14:35:46 2023 daemon.err td-client: No suitable brokers found. Retrying in 5 seconds
Sun Oct 15 14:35:46 2023 daemon.err tunneldigger[10258]: td-client: No suitable brokers found. Retrying in 5 seconds

Devices with said issues: https://map.freifunk-nordhessen.de/#!/de/map/60e32727c0fe https://map.freifunk-nordhessen.de/#!/de/map/8cdef9a91277

What is the expected behaviour? Tunneldigger should retry, but it could make sense to reduce the number of retries after X amount of tries (e.g. every 5 minutes after 10 fails).

Gluon Version: Gluon v2023.1.1 seems like Commit ID 7feebe96a6b93d567bd342ffd01abcb3022383c5

Site Configuration: https://gitlab.com/freifunk-nordhessen/ffnh-site

Custom patches: no custom patches applied to Tunneldigger and/or dnsmasq-wan.

dariks commented 11 months ago

Link to Screen capture of the issue: https://rippen-cloud.de/index.php/s/XK3RztbWmMWHQDN

Screenshot of this issue:

Bildschirmfoto 2023-10-15 um 15 42 18

Screenshot of Grafana Monitoring when this issue occurs: ffnh-firmware-dri-03-high-load-tunneldigger

neocturne commented 11 months ago

Thanks for the report.

Is tunneldigger retrying more frequently than every 5 seconds as printed in the logs, or does it cause such high load even with when only retrying every 5 seconds? What is the load average when you stop tunneldigger (/etc/init.d/tunneldigger stop?

dariks commented 11 months ago

Thank you for taking your time to create such an useful tool.

It doesn't seem to run more often than in the 5 second interval. It rather seems as if the "trying to connect - erroring out ("Failed to connect to remote endpoint - check WAN connectivity!") - Reinitializing tunnel context - trying to connect next gateway" process is running way too quick and causing the load (take a look at the attached link to see a video of the issue.). It seems as if there should be some kind of rate limit at which the Tunneldigger should connect. I can confirm running /etc/init.d/tunneldigger stop mitigates the high load average for a short period of time (Tunneldigger gets restartet automatically, probably by the Tunneldigger-Watchdog). The 1min load average dropped to 0.02 after 3 minutes of running with stopped Tunneldigger.