Closed azrdev closed 6 years ago
The node (https://meshviewer.darmstadt.freifunk.net/#/en/map/10feed08eda6) is running with the latest changes from the master branch.
https://github.com/freifunk-gluon/gluon/commit/582d09615bdcd9d9b6cb5ee74173ab78af3d846d
I found several nodes with high CPU-Load (SYS-Load > 95%) if mesh-on-LAN is active. The nextnode-Page wont load and sometimes the node crashs. After disable mesh-Interface (like "ifconfig eth0 down") the problem disappears.
There is no ugly Flag in "batctl tg"
Gluon-Version: gluon-v2017.1.1+
@Sunz3r does this always occur when mesh on LAN is active, or only when there are also connections on the LAN interfaces (cable plugged in and/or other batman nodes to communicate with)?
@azrdev: I see a process called "autoupdater" and "10stop-network" in the provided log. So seems that it crashed while trying to update?
Can you maybe reliably reproduce the crash when running /usr/sbin/autoupdater manually?
Also, the 842nd v1/v2 seems to be one of those devices with 8MB of flash, but still only 32MB of RAM. Which could explain why this type of device is having issues while trying to update first. Compared to a 841nd, for instance, which has 32MB of RAM too, but only needs to store a 4MB image when updating.
@Sunz3r: Seems like a different issue. Maybe create a new ticket in the issue tracker here on Github?
Can you maybe reliably reproduce the crash when running /usr/sbin/autoupdater manually?
@T-X might be. If so, how would that help us / what should I provide?
@azrdev: One first, interesting thing to find out would be whether the crash happens during or after downloading the image. Can you add some "print/write" statements writing to /dev/kmesg in /usr/sbin/autoupdater to output some debug messages, so we know better at what time of the updating process things get the Out-of-Memory?
If it were possible for you to reproduce the issue reliably then I think it might make sense to add some patches to increase the verbosity of the Out-of-Memory trace, too. For instance more detailed information regarding what is using how much memory not just in userspace but also in kernel space would be very interesting. Not sure, maybe it'd be possible to compile an ar71xx image with CONFIG_KERNEL_SLABINFO=y and dump /proc/slabinfo from within the OOM panic handler, too.
PS: @azrdev or if you can reliably trigger it by executing /usr/sbin/autoupdater from the login shell via the serial then you might not need to write to /dev/kmesg. Then it should be sufficient to write to stdout or stderr. You could sprinkle some lines like this in /usr/sbin/autoupdater then:
io.stderr:write('We are here - line XXX')
@T-X first results:
Without uplink and private wifi disabled (wireless.wan_radio0.disabled='1'
) it doesn't crash. I enabled private wifi again, and uptime still goes up. Have to stick in a cable into the WAN port again, so the setup is the same before moving to the debugging location. It just ran autoupdater once with OOM, once without, so in the current state we don't get useful results
seems like I can (currently) reproduce a crash while receiving the last ~third of the firmware image, i.e. in wget
would be interesting to see how another device with the same specs (e.g. WR841N) performs in the exact same situation (same spot, same configuration). if the issue is the same, #753 would be the correct issue.
azrdev, this reproduceable crash, is it with the private wifi enabled or disabled now? And this node itself has a fastd VPN uplink via its WAN port?
@rotanid I'll test that as suggested.
azrdev, this reproduceable crash, is it with the private wifi enabled or disabled now? And this node itself has a fastd VPN uplink via its WAN port?
@T-X private wifi enabled and fastd vpn uplink at wan port in use, too
@azrdev what about your check, may i close this issue in favor of #753 ?
For me this issue more specific about a certain router model. (and not about all 32MB-RAM units in a certain domain.)
Potentially i would see similar issuo on TL-WA901V5 wen running on V4 image... (this is not good, but since it looks similar, i assume some kind "wrong" in the target.)
thanks for not reading this ticket before leaving a comment. we already agreed that he should test the issue with an WR841 in the same location and the same configuration - if there is no issue then, you might be right. please don't simply assume this without a thorough test.
thanks for not reading this ticket before leaving a comment.
i guess i can remember it from reading the last n times. i guess this was not meant as an ad hominem.
my point was to disagree that's something like #753, just by the fact that your previous question was not marked as "needs answer" and not beeing answered.
in this case it's either a) broken HW (individual defect) b) broken target/profile c) specific issue on network spot. d) overall issue in the l2 domain.
your suggestion was to close this and to move in d). And this assumption i can not follow, since i see -from what i see discussed from above- not an overall instabilty on the network (for certain types of hw class)
replacing the unit with an 841 would help probably the same way as a drop in replacement with an identical 842v2. connecting a serial console might help as well, since OOMs are in many cases not transmited via network (ssh lograd -f, syslog...), since the network stack dies at this very moment.
your suggestion was to close this and to move in d).
simply wrong! my suggestion was to check if another device in the same spot/config has the same problems. if yes, it's d) if not, it's not d)
sorry for delaying this, I'll do the test with the 841
@rotanid "@azrdev what about your check, may i close this issue in favor of #753 ?" reads for me "either you perform the suggested check or we will assume this issue to be a totally different one.
But off course this might be a susccessful strategy to reduce number of issues in case there is no feedback for individual ones which sounded "different" when they were opened.
Anyhow, depending on the outcome here, i would consider to open a similar request for a 901v5 (frequent OOM reboots like https://paste.debian.net/989352/ , where a 841v11 in the same spot performs without problems. But since 1) i do not have a second 901v5 to test for individual HW defect, nor a 901v4 to see if it's an issue with the profile, nor is this build a LEDE, but CC: I can not open a topic. i just like to hint, that there might be similar situations on other routers too 'profile specific').
This is not an attempt to hijack the issue or to derail it, just a not, that am rather curious about the outcome of this one.
So, I had these running now for a month, logging uptime and load (manually, since our dashboard went down).
both nodes were in the same location as previously, and both had fastd vpn uplink via ethernet (wan port). The 842 had private wifi disabled, still seems to have crashed occacionally, as the graph shows. The 841 had private wifi enabled, and shows some suspiciously low uptime, too: seldomly above 24h, and a lot of reboots in a row.
I did not capture serial logs this time, but IMHO the data suggests that the 841 also frequently hangs with private wifi enabled
So what's your conclusion? To change it from "oom on 842" to "oom on 842 with private wifi off and 841 with private wifi on"?
(Sorry, this is not a serious suggestion, but what i may catch from your reply: "happens with 841 on same spot as well" correct?)
happens with 841 on same spot as well
yes, though I'm not sure if it's in the autoupdater (as with the 842) because I didn't capture a serial log
so if this isn't related to a specific device, it might as well be the same as #753 and/or #1243 - right?
@rotanid probably, yes. but iirc the 842 was already instable (long) before 2017.1.x so it would be #753 not #1243
@azrdev ok, let's continue this discussion over there then.
My TP-Link TL-WR842N v2 with firmware from darmstadt.freifunk.net frequently reboots, usually it doesn't get more than 1 hour of uptime. Nothing useful on dmesg logs (except maybe lots of
daemon.notice netifd: client (1352): cat: write error: Broken pipe
), but I got a serial log, to be found at https://git.darmstadt.ccc.de/snippets/9