Closed FFPeter closed 6 years ago
The next time you see this issue, try getting the kernel log using the dmesg
command; it will contain a lot less noise than logread, so if there are any error messages, odds are that you find them there.
Additional info: /etc/inid.d/network restart brings the interfaces up-and-running.
I'm having the same issue with LEDE and someone else has it too on OpenWRT and TP-Link official firmware. In my case there is zero entries about this failure in logread and dmesg. Like @FFPeter noticed /etc/init.d/network restart fixes temporarily the issue.
Here's the relevant topic on LEDE forum: https://forum.lede-project.org/t/lan-stops-working-every-now-and-then
someone opened a bugreport at a better suited location: https://bugs.lede-project.org/index.php?do=details&task_id=794
Please make sure that this is a sw related issue and not just "some nodes with partial defective switches" (i have some 841s showing the same stuff since the beginning. But i learned to live with them via check-scripts doing reboot if copper switch stops packet transport.
https://bugs.lede-project.org/index.php?do=details&task_id=762 (the other one was a duplicate)
Due to those two reports (https://bugs.lede-project.org/index.php?do=details&task_id=762#comment3828) this problem is caused by a patch that Gluon v2016.2.7 shares with LEDE, as plain openwrt and stock firmware do not have this issue.
Maybe @NeoRaider or another developper with more knowledge of gluon's structure can nail it down with a diff.
have you personally tested stock and openwrt with the same usage patterns as your gluon build? i doubt it, but feel free to correct me and provide details if i'm wrong. also, the person claiming to be using OpenWrt without problems "for about a year" is likely not telling the truth, as there hasn't been support for this device in OpenWrt before February.
@MPW1412 the guy claiming to have no problems with OpenWrt doesn't have a 1043v4 but a 1043v2-based 1045v2 - therefore, his claims don't matter as the 1043v4 is different. also, "Mihnea" reported the same issue with OpenWrt as with LEDE. unfortunately i don't have time to dig into this deeper than reading and collecting information. we didn't see problems in our "real world freifunk setups" so far.
Hello rotanid, Mihnea here. The guy claiming to have no problems (Dmitry) is running a local version of 1043ND v4, namely 1045ND v2. The FW he's using is a custom compile of OpenWRT CC. In a nutshell, CC is playing nicely with 1043ND v4.
Yesterday I have reproduced the issue with both the newest OpenWRT snapshot and with LibreCMC (which is LEDE-based).
The issue manifests itself instantly during high throughput upload scenarios, mostly.
Hope this helps.
@ThaVyRuZ thanks for joining here, but: no, he doesn't. looking at his (Dmitry) patch, this 1045ND v2 is clearly a variant of 1043ND v2 and NOT v4 - no matter what he says, the truth lays in the code ;-)
Hello rotanid, understood. Then it is weird that it's happening with v2 as well, that is if Dmitry is experiencing the same bug.
In my case, as i have mentioned before, the bug is trivially easy to reproduce. Daniel, the guy compiling the SuperWRT distro, is also facing the same issue under the same circumstances on his v4 so it's not the fault of a defective unit on my side.
Weird thing for me is that it's only been happening since i have switched from an 100 mbit WAN to a 300 mbit one. With the 100 mbit line the router has never locked up (ran it for like a full month). It's only under the higher load of the 300 mbit line that it crashes. So my assumption is that the bug is present on all 1043ND v4s, it's just that not many people are experiencing it since they are not running the equipment at such high line speeds/loads. That doesn't make the bug less significant: it's still a very annoying issue.
@ThaVyRuZ ok, but it would have been easier to find a fix, if it had actually worked with OpenWrt. now that it doesn't, we have to hope an experienced LEDE/OpenWrt developer wants to help debugging and fixing this. until then, we can't really do anything about it i guess.
upstream it seems there's a fix coming by "Lucian CRISTIAN", so maybe we can backport this to gluon @NeoRaider ?
@NeoRaider, thanks for backporting this. Could we cherry-pick this commit into v2017.1.x, so that v2017.1.5 will get this fix?
This issue blocks a lot of devices from being installed at there designated positions, so I'd be awesome to have this in the next release.
@MPW1412 out of curiosity, do you use VLANs on those devices or for your mesh? apparently most problems with this devices did only appear when using VLANs
To me this issues first occurred on a 1043v4, where we actually used VLANs to connect both Rocket M2s with Gluon via mesh and proprietary APs via client network. We still use v2016.1.7, so the version which failed us is based on OpenWRT.
But, I don't think that @FFPeter, who reported this originally here in our forum, used VLANs in that setup.
Then we reflashed the device and even with just having all ports client or all ports mesh network the device crashed. It is reproducable, that V2s don't crash, V4s do. We don't have V5s yet, as we only started building LEDE based firmware recently.
It occurred mostly at well frequented locations with about 50-80 clients connecting through one 1043, which connects the local mesh cloud to the gateway. The more throughput on the device, the faster they crash.
Just having one ideling at home didn't reproduce the crash. I guess this matches the report from a LEDE user in the upstream bug report, who's device crashed reproducibly during speedtests.
On different nodes (all model TL-WR1043nd-v4) sometimes the copper based ports get 'freezed'. If there´s no MeshOnWifi with other nodes, the node goes offline. Logged in via MeshOnWifi I can see that the interfaces are up, but the countervalues do not count up. I think the logread I got was to late to catch the failure..
logread.txt
https://forum.freifunk-muensterland.de/t/treiberproblem-bei-tp-link-1043-v4-in-firmware-v2016-2-3-v2016-2-4/2634