freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
544 stars 325 forks source link

switch ports freezing on TL-WR1043ND v4 #1101

Closed FFPeter closed 6 years ago

FFPeter commented 7 years ago

On different nodes (all model TL-WR1043nd-v4) sometimes the copper based ports get 'freezed'. If there´s no MeshOnWifi with other nodes, the node goes offline. Logged in via MeshOnWifi I can see that the interfaces are up, but the countervalues do not count up. I think the logread I got was to late to catch the failure..
logread.txt

https://forum.freifunk-muensterland.de/t/treiberproblem-bei-tp-link-1043-v4-in-firmware-v2016-2-3-v2016-2-4/2634

neocturne commented 7 years ago

The next time you see this issue, try getting the kernel log using the dmesg command; it will contain a lot less noise than logread, so if there are any error messages, odds are that you find them there.

FFPeter commented 7 years ago

No problem: dmesg.txt

FFPeter commented 7 years ago

Additional info: /etc/inid.d/network restart brings the interfaces up-and-running.

Misiek304 commented 7 years ago

I'm having the same issue with LEDE and someone else has it too on OpenWRT and TP-Link official firmware. In my case there is zero entries about this failure in logread and dmesg. Like @FFPeter noticed /etc/init.d/network restart fixes temporarily the issue.

Here's the relevant topic on LEDE forum: https://forum.lede-project.org/t/lan-stops-working-every-now-and-then

rotanid commented 7 years ago

someone opened a bugreport at a better suited location: https://bugs.lede-project.org/index.php?do=details&task_id=794

Adorfer commented 7 years ago

Please make sure that this is a sw related issue and not just "some nodes with partial defective switches" (i have some 841s showing the same stuff since the beginning. But i learned to live with them via check-scripts doing reboot if copper switch stops packet transport.

neocturne commented 7 years ago

https://bugs.lede-project.org/index.php?do=details&task_id=762 (the other one was a duplicate)

MPW1412 commented 6 years ago

Due to those two reports (https://bugs.lede-project.org/index.php?do=details&task_id=762#comment3828) this problem is caused by a patch that Gluon v2016.2.7 shares with LEDE, as plain openwrt and stock firmware do not have this issue.

Maybe @NeoRaider or another developper with more knowledge of gluon's structure can nail it down with a diff.

rotanid commented 6 years ago

have you personally tested stock and openwrt with the same usage patterns as your gluon build? i doubt it, but feel free to correct me and provide details if i'm wrong. also, the person claiming to be using OpenWrt without problems "for about a year" is likely not telling the truth, as there hasn't been support for this device in OpenWrt before February.

rotanid commented 6 years ago

@MPW1412 the guy claiming to have no problems with OpenWrt doesn't have a 1043v4 but a 1043v2-based 1045v2 - therefore, his claims don't matter as the 1043v4 is different. also, "Mihnea" reported the same issue with OpenWrt as with LEDE. unfortunately i don't have time to dig into this deeper than reading and collecting information. we didn't see problems in our "real world freifunk setups" so far.

ThaVyRuZ commented 6 years ago

Hello rotanid, Mihnea here. The guy claiming to have no problems (Dmitry) is running a local version of 1043ND v4, namely 1045ND v2. The FW he's using is a custom compile of OpenWRT CC. In a nutshell, CC is playing nicely with 1043ND v4.

Yesterday I have reproduced the issue with both the newest OpenWRT snapshot and with LibreCMC (which is LEDE-based).

The issue manifests itself instantly during high throughput upload scenarios, mostly.

Hope this helps.

rotanid commented 6 years ago

@ThaVyRuZ thanks for joining here, but: no, he doesn't. looking at his (Dmitry) patch, this 1045ND v2 is clearly a variant of 1043ND v2 and NOT v4 - no matter what he says, the truth lays in the code ;-)

ThaVyRuZ commented 6 years ago

Hello rotanid, understood. Then it is weird that it's happening with v2 as well, that is if Dmitry is experiencing the same bug.

In my case, as i have mentioned before, the bug is trivially easy to reproduce. Daniel, the guy compiling the SuperWRT distro, is also facing the same issue under the same circumstances on his v4 so it's not the fault of a defective unit on my side.

Weird thing for me is that it's only been happening since i have switched from an 100 mbit WAN to a 300 mbit one. With the 100 mbit line the router has never locked up (ran it for like a full month). It's only under the higher load of the 300 mbit line that it crashes. So my assumption is that the bug is present on all 1043ND v4s, it's just that not many people are experiencing it since they are not running the equipment at such high line speeds/loads. That doesn't make the bug less significant: it's still a very annoying issue.

rotanid commented 6 years ago

@ThaVyRuZ ok, but it would have been easier to find a fix, if it had actually worked with OpenWrt. now that it doesn't, we have to hope an experienced LEDE/OpenWrt developer wants to help debugging and fixing this. until then, we can't really do anything about it i guess.

rotanid commented 6 years ago

upstream it seems there's a fix coming by "Lucian CRISTIAN", so maybe we can backport this to gluon @NeoRaider ?

MPW1412 commented 6 years ago

@NeoRaider, thanks for backporting this. Could we cherry-pick this commit into v2017.1.x, so that v2017.1.5 will get this fix?

This issue blocks a lot of devices from being installed at there designated positions, so I'd be awesome to have this in the next release.

rotanid commented 6 years ago

@MPW1412 out of curiosity, do you use VLANs on those devices or for your mesh? apparently most problems with this devices did only appear when using VLANs

MPW1412 commented 6 years ago

To me this issues first occurred on a 1043v4, where we actually used VLANs to connect both Rocket M2s with Gluon via mesh and proprietary APs via client network. We still use v2016.1.7, so the version which failed us is based on OpenWRT.

But, I don't think that @FFPeter, who reported this originally here in our forum, used VLANs in that setup.

Then we reflashed the device and even with just having all ports client or all ports mesh network the device crashed. It is reproducable, that V2s don't crash, V4s do. We don't have V5s yet, as we only started building LEDE based firmware recently.

It occurred mostly at well frequented locations with about 50-80 clients connecting through one 1043, which connects the local mesh cloud to the gateway. The more throughput on the device, the faster they crash.

Just having one ideling at home didn't reproduce the crash. I guess this matches the report from a LEDE user in the upstream bug report, who's device crashed reproducibly during speedtests.