freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
539 stars 325 forks source link

frequent OOM on 842N v2 #1197

Closed azrdev closed 6 years ago

azrdev commented 7 years ago

My TP-Link TL-WR842N v2 with firmware from darmstadt.freifunk.net frequently reboots, usually it doesn't get more than 1 hour of uptime. Nothing useful on dmesg logs (except maybe lots of daemon.notice netifd: client (1352): cat: write error: Broken pipe), but I got a serial log, to be found at https://git.darmstadt.ccc.de/snippets/9

mweinelt commented 7 years ago

The node (https://meshviewer.darmstadt.freifunk.net/#/en/map/10feed08eda6) is running with the latest changes from the master branch.

https://github.com/freifunk-gluon/gluon/commit/582d09615bdcd9d9b6cb5ee74173ab78af3d846d

Sunz3r commented 7 years ago

I found several nodes with high CPU-Load (SYS-Load > 95%) if mesh-on-LAN is active. The nextnode-Page wont load and sometimes the node crashs. After disable mesh-Interface (like "ifconfig eth0 down") the problem disappears.

There is no ugly Flag in "batctl tg"

Gluon-Version: gluon-v2017.1.1+

azrdev commented 7 years ago

@Sunz3r does this always occur when mesh on LAN is active, or only when there are also connections on the LAN interfaces (cable plugged in and/or other batman nodes to communicate with)?

T-X commented 6 years ago

@azrdev: I see a process called "autoupdater" and "10stop-network" in the provided log. So seems that it crashed while trying to update?

Can you maybe reliably reproduce the crash when running /usr/sbin/autoupdater manually?

T-X commented 6 years ago

Also, the 842nd v1/v2 seems to be one of those devices with 8MB of flash, but still only 32MB of RAM. Which could explain why this type of device is having issues while trying to update first. Compared to a 841nd, for instance, which has 32MB of RAM too, but only needs to store a 4MB image when updating.

T-X commented 6 years ago

@Sunz3r: Seems like a different issue. Maybe create a new ticket in the issue tracker here on Github?

azrdev commented 6 years ago

Can you maybe reliably reproduce the crash when running /usr/sbin/autoupdater manually?

@T-X might be. If so, how would that help us / what should I provide?

T-X commented 6 years ago

@azrdev: One first, interesting thing to find out would be whether the crash happens during or after downloading the image. Can you add some "print/write" statements writing to /dev/kmesg in /usr/sbin/autoupdater to output some debug messages, so we know better at what time of the updating process things get the Out-of-Memory?

If it were possible for you to reproduce the issue reliably then I think it might make sense to add some patches to increase the verbosity of the Out-of-Memory trace, too. For instance more detailed information regarding what is using how much memory not just in userspace but also in kernel space would be very interesting. Not sure, maybe it'd be possible to compile an ar71xx image with CONFIG_KERNEL_SLABINFO=y and dump /proc/slabinfo from within the OOM panic handler, too.

T-X commented 6 years ago

PS: @azrdev or if you can reliably trigger it by executing /usr/sbin/autoupdater from the login shell via the serial then you might not need to write to /dev/kmesg. Then it should be sufficient to write to stdout or stderr. You could sprinkle some lines like this in /usr/sbin/autoupdater then:

io.stderr:write('We are here - line XXX')

azrdev commented 6 years ago

@T-X first results:

Without uplink and private wifi disabled (wireless.wan_radio0.disabled='1') it doesn't crash. I enabled private wifi again, and uptime still goes up. Have to stick in a cable into the WAN port again, so the setup is the same before moving to the debugging location. It just ran autoupdater once with OOM, once without, so in the current state we don't get useful results

azrdev commented 6 years ago

seems like I can (currently) reproduce a crash while receiving the last ~third of the firmware image, i.e. in wget

rotanid commented 6 years ago

would be interesting to see how another device with the same specs (e.g. WR841N) performs in the exact same situation (same spot, same configuration). if the issue is the same, #753 would be the correct issue.

T-X commented 6 years ago

azrdev, this reproduceable crash, is it with the private wifi enabled or disabled now? And this node itself has a fastd VPN uplink via its WAN port?

azrdev commented 6 years ago

@rotanid I'll test that as suggested.

azrdev, this reproduceable crash, is it with the private wifi enabled or disabled now? And this node itself has a fastd VPN uplink via its WAN port?

@T-X private wifi enabled and fastd vpn uplink at wan port in use, too

rotanid commented 6 years ago

@azrdev what about your check, may i close this issue in favor of #753 ?

Adorfer commented 6 years ago

For me this issue more specific about a certain router model. (and not about all 32MB-RAM units in a certain domain.)

Potentially i would see similar issuo on TL-WA901V5 wen running on V4 image... (this is not good, but since it looks similar, i assume some kind "wrong" in the target.)

rotanid commented 6 years ago

thanks for not reading this ticket before leaving a comment. we already agreed that he should test the issue with an WR841 in the same location and the same configuration - if there is no issue then, you might be right. please don't simply assume this without a thorough test.

Adorfer commented 6 years ago

thanks for not reading this ticket before leaving a comment.

i guess i can remember it from reading the last n times. i guess this was not meant as an ad hominem.

my point was to disagree that's something like #753, just by the fact that your previous question was not marked as "needs answer" and not beeing answered.

in this case it's either a) broken HW (individual defect) b) broken target/profile c) specific issue on network spot. d) overall issue in the l2 domain.

your suggestion was to close this and to move in d). And this assumption i can not follow, since i see -from what i see discussed from above- not an overall instabilty on the network (for certain types of hw class)

replacing the unit with an 841 would help probably the same way as a drop in replacement with an identical 842v2. connecting a serial console might help as well, since OOMs are in many cases not transmited via network (ssh lograd -f, syslog...), since the network stack dies at this very moment.

rotanid commented 6 years ago

your suggestion was to close this and to move in d).

simply wrong! my suggestion was to check if another device in the same spot/config has the same problems. if yes, it's d) if not, it's not d)

azrdev commented 6 years ago

sorry for delaying this, I'll do the test with the 841

Adorfer commented 6 years ago

@rotanid "@azrdev what about your check, may i close this issue in favor of #753 ?" reads for me "either you perform the suggested check or we will assume this issue to be a totally different one.

But off course this might be a susccessful strategy to reduce number of issues in case there is no feedback for individual ones which sounded "different" when they were opened.

Anyhow, depending on the outcome here, i would consider to open a similar request for a 901v5 (frequent OOM reboots like https://paste.debian.net/989352/ , where a 841v11 in the same spot performs without problems. But since 1) i do not have a second 901v5 to test for individual HW defect, nor a 901v4 to see if it's an issue with the profile, nor is this build a LEDE, but CC: I can not open a topic. i just like to hint, that there might be similar situations on other routers too 'profile specific'). This is not an attempt to hijack the issue or to derail it, just a not, that am rather curious about the outcome of this one. grafik

azrdev commented 6 years ago

So, I had these running now for a month, logging uptime and load (manually, since our dashboard went down).

both nodes were in the same location as previously, and both had fastd vpn uplink via ethernet (wan port). The 842 had private wifi disabled, still seems to have crashed occacionally, as the graph shows. The 841 had private wifi enabled, and shows some suspiciously low uptime, too: seldomly above 24h, and a lot of reboots in a row.

I did not capture serial logs this time, but IMHO the data suggests that the 841 also frequently hangs with private wifi enabled

841:

2017-10-15 caek uptime 2

842

842nd

Adorfer commented 6 years ago

So what's your conclusion? To change it from "oom on 842" to "oom on 842 with private wifi off and 841 with private wifi on"?

(Sorry, this is not a serious suggestion, but what i may catch from your reply: "happens with 841 on same spot as well" correct?)

azrdev commented 6 years ago

happens with 841 on same spot as well

yes, though I'm not sure if it's in the autoupdater (as with the 842) because I didn't capture a serial log

rotanid commented 6 years ago

so if this isn't related to a specific device, it might as well be the same as #753 and/or #1243 - right?

azrdev commented 6 years ago

@rotanid probably, yes. but iirc the 842 was already instable (long) before 2017.1.x so it would be #753 not #1243

rotanid commented 6 years ago

@azrdev ok, let's continue this discussion over there then.