Closed rotanid closed 7 years ago
we created a workaround-package to make sure respondd doesn't stay not-running: https://github.com/tecff/gluon-packages/tree/master/tecff-respondd-watchdog
I found the same problem today with 2016.2 but was not able to debug it. Will try your package
not easy to test, i will have to backport it to v2016.2.x, remove my workaround, deploy it to the experimental-branch nodes and wait at least a week, the longer the better, if one of them will have the issue again. as a late comment to @AKA-47 , in our network it appears with v2016.1.x, too (assuming you're talking about gluon version numer, not badman-adv)
@jplitza i backported your commit d8bb97831b197c543d9727f1c81539bb6fba127a together with ca57cdfe77b73351708f3b18235f508a42bccac9 (made it very easy, porting both) to v2016.2.x and removed my cronjob/watchdog. running it on ~60 nodes at the moment. we'll have to wait a few weeks to make a somewhat strong statement about the effect.
By the way, the same seems to happen with alfred. (Yes, I know, we're still using alfred...)
@jplitza so far (3 weeks) it didn't appear again on these ~60 nodes running with the v2016.2.x branch patched with the two mentioned commits. maybe backporting this @NeoRaider ?
I backported the relevant patches to v2016.2.x.
@kpanic23 does the backport of @NeoRaider fix the issue with alfred, too? Or could you open a separate issue for that?
I don't know. I have not built new images yet. In our current stable image, there is a custom package restarting alfred if it crashes. And, to be frank and honest: We have taken that problem as reason to begin switching to hopglass and will drop alfred support completely in the next images: https://map.ff3l.net/hopglass
As this issue seems to be fixed (in both master and v2016.2.x) and alfred is another issue, I'm closing this.
we observed some cases, where respondd wasn't running anymore suddenly. on the nodes where we have remote access, there doesn't seem to be any log entry regarding it's stop or crash in dmesg or logread.
if you have remote access, the issue can be fixed by simply starting gluon-respondd again.
how do you know that respondd is down without remote access:
the changes in https://github.com/freifunk-gluon/packages/pull/143 regarding the init file may workaround this issue as the new initfile contains the "respawn" option.