freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes

https://gluon.readthedocs.io

Other

537 stars 325 forks source link

respondd suddenly not running anymore #863

Closed rotanid closed 7 years ago

rotanid commented 7 years ago

we observed some cases, where respondd wasn't running anymore suddenly. on the nodes where we have remote access, there doesn't seem to be any log entry regarding it's stop or crash in dmesg or logread.

if you have remote access, the issue can be fixed by simply starting gluon-respondd again.

how do you know that respondd is down without remote access:

if you use meshviewer, the node still appears as online but with less information, for example client cound and uptime are missing.
if you use respondd data (like with hopglass), the node appears as offline.

the changes in https://github.com/freifunk-gluon/packages/pull/143 regarding the init file may workaround this issue as the new initfile contains the "respawn" option.

rotanid commented 7 years ago

we created a workaround-package to make sure respondd doesn't stay not-running: https://github.com/tecff/gluon-packages/tree/master/tecff-respondd-watchdog

AKA-47 commented 7 years ago

I found the same problem today with 2016.2 but was not able to debug it. Will try your package

jplitza commented 7 years ago

972 improved the initscript so that stderr is logged and respondd automatically restarted. It would be great if anybody experiencing unexpected stops could try that new initscript and look out for messages in logread.

rotanid commented 7 years ago

not easy to test, i will have to backport it to v2016.2.x, remove my workaround, deploy it to the experimental-branch nodes and wait at least a week, the longer the better, if one of them will have the issue again. as a late comment to @AKA-47 , in our network it appears with v2016.1.x, too (assuming you're talking about gluon version numer, not badman-adv)

rotanid commented 7 years ago

@jplitza i backported your commit d8bb97831b197c543d9727f1c81539bb6fba127a together with ca57cdfe77b73351708f3b18235f508a42bccac9 (made it very easy, porting both) to v2016.2.x and removed my cronjob/watchdog. running it on ~60 nodes at the moment. we'll have to wait a few weeks to make a somewhat strong statement about the effect.

kpanic23 commented 7 years ago

By the way, the same seems to happen with alfred. (Yes, I know, we're still using alfred...)

rotanid commented 7 years ago

@jplitza so far (3 weeks) it didn't appear again on these ~60 nodes running with the v2016.2.x branch patched with the two mentioned commits. maybe backporting this @NeoRaider ?

neocturne commented 7 years ago

I backported the relevant patches to v2016.2.x.

ghost commented 7 years ago

@kpanic23 does the backport of @NeoRaider fix the issue with alfred, too? Or could you open a separate issue for that?

kpanic23 commented 7 years ago

I don't know. I have not built new images yet. In our current stable image, there is a custom package restarting alfred if it crashes. And, to be frank and honest: We have taken that problem as reason to begin switching to hopglass and will drop alfred support completely in the next images: https://map.ff3l.net/hopglass

jplitza commented 7 years ago

As this issue seems to be fixed (in both master and v2016.2.x) and alfred is another issue, I'm closing this.