Open stweil opened 1 year ago
Please try a Firmware based on the next branch to asses if the issue was fixed in the driver.
The first node which fails most often writes log data to a server which is reached via IPv4 (so the log data can be sent directly via the mobile router, not using the FFRN gateway). That log data shows messages which are mentioned in issue https://github.com/openwrt/mt76/issues/690. According to a dump of their EEPROMs all three ZyXEL routers seem to have a value of 0 for the 5 GHz antenna selection. Therefore I assume that they need the fix which is mentioned in the issue, but which is not part of the v2022.1.x. That branch uses an "older" release of mt76. Not only the issue fix is missing in that release but also lots of other fixes for MT7915E (the chip used for 5 GHz wifi). Some of the missing fixes were for memory leaks and might be required for any router with MT7915E, even if the antenna selection works fine.
Since 2023-03-16 22:06:00 the first node is running with my own patched firmware which includes some fixes for mt76 and up to now did not fail.
Please try a Firmware based on the next branch to asses if the issue was fixed in the driver.
I already tried building of the master branch, but was not successful. I'll try next
next. :-)
According to a dump of their EEPROMs all three ZyXEL routers seem to have a value of 0 for the 5 GHz antenna selection.
Can you provide a image of the EEPROM? This would help significantly.
Here is a dump of the first part of the factory data used for the EEPROM (I hope this is sufficient as it includes the antenna selection value).
# hexdump -C /dev/mtdblock2
00000000 15 79 00 00 00 0c 43 26 46 48 00 0c 43 26 59 97 |.y....C&FH..C&Y.|
00000010 15 79 c3 14 00 80 02 00 15 79 c3 14 0d 99 01 00 |.y.......y......|
00000020 16 79 c3 14 00 80 02 00 16 79 c3 14 ed 94 01 00 |.y.......y......|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000040 00 00 00 00 00 00 00 00 00 00 80 00 00 00 00 00 |................|
00000050 01 00 80 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000070 30 1c 01 00 00 00 07 a9 ca 00 00 00 00 00 00 00 |0...............|
00000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000000a0 00 00 00 00 00 00 00 00 c1 db f1 25 d3 d1 e1 57 |...........%...W|
000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000190 92 52 06 aa 38 00 32 2b 00 00 00 00 00 00 00 00 |.R..8.2+........|
000001a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000001b0 00 00 00 00 00 00 00 00 00 00 12 00 00 00 00 0a |................|
000001c0 00 09 16 0a 00 09 16 09 00 09 16 00 00 00 00 00 |................|
000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
Please upload the factory partition from mtd. You can create a copy using dd
.
See factory.bin.
Please try a Firmware based on the next branch to asses if the issue was fixed in the driver.
As far as I see the master ~and next~ branch still uses an old revision d70546462b7b51ebc2bcdd5c534fdf3465be62a4 for mt76 and does not include the required minimum fix nor any others from the bug fixes which were committed later. So testing master would not be helpful. The next branch uses recent code for mt76, so hopefully will fix wifi related problems. I'll report as soon as the new firmware is running.
The status page of the updated nodes now looks different. "Kontakt" is shown (it wasn't with the previous firmware), and "Nachbarknoten" no longer displays a realtime diagram for the wireless mesh connections. See the node 1 for a status page with the previous firmware.
The status page also shows a JavaScript error for status-page.js:748 (tdDistance is undefined). The HTML codeis missing the attribute data-interface-wireless
for the meshs (correct code would be <div data-interface="mesh0" data-interface-address="72:a2:96:cb:fe:81" data-interface-wireless="true">
).
The missing /sys/devices/platform/1e140000.pcie/pci0000:00/0000:00:01.0/0000:02:00.0/net/mesh0/wireless
causes the missing HTML attribute data-interface-wireless
.
mt76 and does not include the required minimum fix
The fix you are referring to does only affect the situation in case the SS configuration is missing from EEPROM but it is present in the file you've linked as well as the hexdump posted earlier (193h --> AAh)
Th breakage on the satus-page can happen. next is not validated for any functionality
You are right, nss is (0xaa & 0x03) = 0x02. Nevertheless I see more than 1000 log entries like the one mentioned in https://github.com/openwrt/mt76/issues/690: Mar 16 12:17:44 Ladenburg-INTAKT-001 kernel: [62188.424399] mt7915e 0000:02:00.0: Message 00005aed (seq 9) timeout
(that was the last one which I have seen on node 1 before installing the patched firmware).
Another log message which occurred often on all three nodes is this one:
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.269001] WARNING: CPU: 1 PID: 3501 at backports-5.15.92-1/net/mac80211/mesh_hwmp.c:353 airtime_link_metric_get+0x120/0x684 [mac80211]
[...]
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.269825] CPU: 1 PID: 3501 Comm: respondd Tainted: G W 5.10.168 #0
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.269833] Stack : 809a0000 00000580 807c62fc 8007c880 80840000 8072fa80 00000000 00000000
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.269895] 8478987c 80980000 806fff68 81de8f00 807c8dc7 00000001 84789820 4564a25f
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.269960] 00000000 00000000 806fff68 847896c0 c0000dc7 00000000 ffffffea 00000000
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270006] 847896cc 00000dc7 807cda90 20202020 80000000 00000001 00000000 80700000
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270049] 00000009 82d80da0 8236f1f0 00000580 00000018 ffffffff 00000004 80980004
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270093] ...
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270118] Call Trace:
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270158] [<80007fd0>] show_stack+0x30/0x100
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270198] [<80320218>] dump_stack+0x9c/0xcc
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270229] [<8002ff20>] __warn+0xb0/0x11c
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270242] [<8002ffe8>] warn_slowpath_fmt+0x5c/0xac
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270311] [<82f6c498>] airtime_link_metric_get+0x120/0x684 [mac80211]
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270373] [<82f09004>] sta_set_sinfo+0x7e0/0xfec [mac80211]
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270436] [<82f23ff0>] ieee80211_nan_func_match+0x14bc/0x2214 [mac80211]
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270545] [<82e2dc90>] nl80211_put_sta_rate+0x140c/0x15e0 [cfg80211]
It also occurs on the nodes with Gluon next based firmware (about 10 x per hour).
The breakage on the status page can happen [...]
Should I report that as a separate issue?
Please edit your comment instead of posting multiple related ones subsequently.
First of all, the warning you see is a bug in mac80211 in case no airtime metric has yet been acquired from the target. This can be ignored.
The message itself is very ambiguous, it can be insufficient power or misbehaving firmware. mt76 does allow for collecting firmware debug information, have you tried acquiring those?
The message itself is very ambiguous, it can be insufficient power or misbehaving firmware. mt76 does allow for collecting firmware debug information, have you tried acquiring those?
If you are referring to the mt7915e timeout message, that one no longer occurred after I installed firmware with newer mt76 code. But I now wrote 1 to /sys/kernel/debug/ieee80211/phy{0,1}/mt76/fw_debug_wm
. That floods my logfile, so I had to disable it again. I can try to enable it after a new failure.
Meanwhile node 1 runs with firmware based on the stable release, but patched with a more recent mt76, since more than 8 days. This is the longest uninterrupted uptime for this node so far since its installation in December 2022. Up to now dmesg
only shows these warnings, but none of the critical timeout messages:
WARNING: CPU: 3 PID: 28464 at backports-5.15.92-1/net/mac80211/mesh_hwmp.c:353 airtime_link_metric_get+0x120/0x684 [mac80211]
(> 50 times, can be ignored according to the comment above)
br-client: received packet on bat0 with own address as source address (addr:b8:ec:a3:e0:f0:50, vlan:0)
(> 100 times)
The nodes 2 and 3 which are less heavily used had reached up to 19 days uptime with an unpatched stable firmware. Both are running fine now with firmware based on gluon next since more than 3 days.
Node 1 failed at 19:53, but restarted by itself an hour later:
INFO:127.0.0.1:<3>Mar 25 19:53:47 Ladenburg-INTAKT-001 kernel: [769695.356542] mt7915e 0000:02:00.0: Message 00005aed (seq 9) timeout
[....]
INFO:127.0.0.1:<29>Mar 25 20:05:43 Ladenburg-INTAKT-001 netifd: wan6 (2577): Command failed: ubus call network.interface notify_proto { "action": 0, "link-up": true, "data": { }, "keep": false, "ip6addr": [ { "ipaddr": "2a01:598:c829:db5c:70a2:96ff:fecb:fe80", "mask": "64", "preferred": 14399, "valid": 86399, "offlink": true } ], "routes6": [ { "target": "::", "netmask": "0", "gateway": "fe80::a40e:6fff:fe23:c8c9", "metric": 640, "valid": 29 }, { "target": "2a01:598:c829:db5c::", "netmask": "64", "metric": 256, "valid": 86399 } ], "dns": [ "fe80::a40e:6fff:fe23:c8c9" ], "interface": "wan6" } (Request timed out)
[...]
INFO:127.0.0.1:<29>Mar 25 20:26:14 Ladenburg-INTAKT-001 netifd: client (2536): Command failed: ubus call network.interface notify_proto { "action": 0, "link-up": true, "data": { }, "keep": false, "ip6addr": [ { "ipaddr": "fdc3:67ce:cc7e:9001:baec:a3ff:fee0:f050", "mask": "64", "preferred": 14394, "valid": 86394, "offlink": true }, { "ipaddr": "2a01:4f8:140:7700:baec:a3ff:fee0:f050", "mask": "64", "preferred": 14394, "valid": 86394, "offlink": true }, { "ipaddr": "2a01:4f8:160:9700:baec:a3ff:fee0:f050", "mask": "64", "preferred": 14388, "valid": 86388, "offlink": true }, { "ipaddr": "2a01:4f8:171:fcff:baec:a3ff:fee0:f050", "mask": "64", "preferred": 14376, "valid": 86376, "offlink": true } ], "routes6": [ { "target": "fdc3:67ce:cc7e:9001::", "netmask": "64", "metric": 256, "valid": 86394 }, { "target": "2a01:4f8:140:7700::", "netmask": "64", "metric": 256, "valid": 86394 }, { "target": "2a01:4f8:160:9700::", "netmask": "64", "metric": 256, "valid": 86388 }, { "target": "2a01:4f8:171:fcff::", "netmask": "64", "metric": 256, "valid": 86376 }, { "
INFO:127.0.0.1:<30>Mar 25 20:57:30 Ladenburg-INTAKT-001 logread[1368]: Logread connected to 127.0.0.1:8888 via tcp
Node 1 now runs since 10 days with a firmware based on the next
branch and latest openwrt and mt76.
I noticed that it does not get any owe client connections. iwinfo
shows owe0
and owe1
with many unknown
entries:
owe0 ESSID: "owe.freifunk-rhein-neckar.de"
Access Point: 72:A2:96:CB:FE:82
Mode: Master Channel: 6 (2.437 GHz) HT Mode: HE20
Center Channel 1: 6 2: unknown
Tx-Power: 20 dBm Link Quality: unknown/70
Signal: unknown Noise: -91 dBm
Bit Rate: unknown
Encryption: WPA3 OWE (CCMP)
Type: nl80211 HW Mode(s): 802.11ax/b/g/n
Hardware: 14C3:7915 14C3:7915 [MediaTek MT7915E]
TX power offset: none
Frequency offset: none
Supports VAPs: yes PHY name: phy0
ifconfig
shows that the owe interfaces send packages, but don't receive ones:
owe0 Link encap:Ethernet HWaddr 72:A2:96:CB:FE:82
inet6 addr: fe80::70a2:96ff:fecb:fe82/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:21752335 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:1567629488 (1.4 GiB)
Update 2023-04-16: All routers (also those with stable firmware) show similar output for iwinfo
and ifconfig
as long as no client has connected. So that is no indicator for problems with owe connections. Nevertheless it is surprising that not a single successful owe connection happened since the firmware update ten days ago.
Sounds like nbd found a way to fix the issue (see linked issue) I'm betting it was this one https://github.com/openwrt/openwrt/commit/c8963580aaf6a1207367c809d1350557d6e075e9
Should we consider backporting it to openwrt-22?
Should we consider backporting it to openwrt-22?
Backporting latest mt76 seems to be a good idea. Since 9 days our node Ladenburg-INTAKT-001 runs stable with firmware made from gluon next. The owe problem is gone, owe works again.
Current mt76 is not compatible with mac80211 used in OpenWrt 22.03.
What would be more productive is backporting (and validating!) said patch from nbd to 23.05.
The latest unstable firmware still fails with timeout
very often. I don't know which patches it includes. @herbetom?
One of our users also reported a non working WLAN mesh on one of his D-Link DAP-X1860 (A1) with Gluon 2022.1.4 and 2023.1. He has several of those devices. All of them do WLAN mesh properly, except a single one.
Unfortunately I don't have SSH access, so I cannot provide any logs.
Typically the nodes which are heavily used fail more often, sometimes several times per day, while other nodes in the same WLAN mesh with less clients can run several days or even weeks without failures.
Meanwhile I run a cron job which checks ~logread~ the output of dmesg
for "timeout" and reboots the node if a timeout was detected. In addition it writes the current time to a logfile before rebooting. That is a workaround which works pretty well.
actually there is an issue for meshing between ath10k/Qualcomm and mt76/Mediatek (wrong meshing code in Linux in general vs qualcomms own hw acceleration for the wifi frames in question) https://github.com/freifunk-gluon/gluon/pull/2905 v2023.1 works around this. gluon master has the full fix from upstream.
so yeah: recommend ueers who encounter issues with meshing to update to v2023.1 :)
NWA50AX still loosing mesh and crashing on 2023.1.x
Bug report
What is the problem?
The ramips-mt7621 device ZyXEL NWA50AX does not run reliably in a wireless mesh with 3 such devices. In December 2022 the routers were installed using FFRN firmware 2.0.3-20221104. Meanwhile all of them have failed multiple times without any obvious reason. "Failed" means in most cases that they can no longer be reached per SSH via gateway and that wireless clients no longer get Internet although they still can connect, but sometimes the routers also restarted without an obvious cause. Some information was collected at the FFRN forum. There is also a protocol with failures and data from the failed devices for the first month of operation at my GitHub wiki.
The mesh is connected to the Internet by an LTE router (initially 4G, meanwhile 5G). The first node which is directly connected to that router is heavily used by wireless clients and fails most often (sometimes after only a few hours uptime). The two others fail less often, perhaps depending on their use by wireless clients.
What is the expected behaviour?
As the nodes run unattended and cannot be easily accessed physically, it is essential that they work reliably.
Gluon Version:
v2022.1.1 and newer versions from branch v2022.1.x.
Site Configuration:
https://github.com/Freifunk-Rhein-Neckar/site-ffrn/releases/tag/v2.0.3-20221104 and newer versions.
Custom patches:
None.