freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
547 stars 325 forks source link

ramips-mt7621: ZyXEL NWA50AX losing mesh uplink connection #2822

Open stweil opened 1 year ago

stweil commented 1 year ago

Bug report

What is the problem?

The ramips-mt7621 device ZyXEL NWA50AX does not run reliably in a wireless mesh with 3 such devices. In December 2022 the routers were installed using FFRN firmware 2.0.3-20221104. Meanwhile all of them have failed multiple times without any obvious reason. "Failed" means in most cases that they can no longer be reached per SSH via gateway and that wireless clients no longer get Internet although they still can connect, but sometimes the routers also restarted without an obvious cause. Some information was collected at the FFRN forum. There is also a protocol with failures and data from the failed devices for the first month of operation at my GitHub wiki.

The mesh is connected to the Internet by an LTE router (initially 4G, meanwhile 5G). The first node which is directly connected to that router is heavily used by wireless clients and fails most often (sometimes after only a few hours uptime). The two others fail less often, perhaps depending on their use by wireless clients.

What is the expected behaviour?

As the nodes run unattended and cannot be easily accessed physically, it is essential that they work reliably.

Gluon Version:

v2022.1.1 and newer versions from branch v2022.1.x.

Site Configuration:

https://github.com/Freifunk-Rhein-Neckar/site-ffrn/releases/tag/v2.0.3-20221104 and newer versions.

Custom patches:

None.

blocktrron commented 1 year ago

Please try a Firmware based on the next branch to asses if the issue was fixed in the driver.

stweil commented 1 year ago

The first node which fails most often writes log data to a server which is reached via IPv4 (so the log data can be sent directly via the mobile router, not using the FFRN gateway). That log data shows messages which are mentioned in issue https://github.com/openwrt/mt76/issues/690. According to a dump of their EEPROMs all three ZyXEL routers seem to have a value of 0 for the 5 GHz antenna selection. Therefore I assume that they need the fix which is mentioned in the issue, but which is not part of the v2022.1.x. That branch uses an "older" release of mt76. Not only the issue fix is missing in that release but also lots of other fixes for MT7915E (the chip used for 5 GHz wifi). Some of the missing fixes were for memory leaks and might be required for any router with MT7915E, even if the antenna selection works fine.

Since 2023-03-16 22:06:00 the first node is running with my own patched firmware which includes some fixes for mt76 and up to now did not fail.

stweil commented 1 year ago

Please try a Firmware based on the next branch to asses if the issue was fixed in the driver.

I already tried building of the master branch, but was not successful. I'll try next next. :-)

blocktrron commented 1 year ago

According to a dump of their EEPROMs all three ZyXEL routers seem to have a value of 0 for the 5 GHz antenna selection.

Can you provide a image of the EEPROM? This would help significantly.

stweil commented 1 year ago

Here is a dump of the first part of the factory data used for the EEPROM (I hope this is sufficient as it includes the antenna selection value).

# hexdump -C /dev/mtdblock2
00000000  15 79 00 00 00 0c 43 26  46 48 00 0c 43 26 59 97  |.y....C&FH..C&Y.|
00000010  15 79 c3 14 00 80 02 00  15 79 c3 14 0d 99 01 00  |.y.......y......|
00000020  16 79 c3 14 00 80 02 00  16 79 c3 14 ed 94 01 00  |.y.......y......|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000040  00 00 00 00 00 00 00 00  00 00 80 00 00 00 00 00  |................|
00000050  01 00 80 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000070  30 1c 01 00 00 00 07 a9  ca 00 00 00 00 00 00 00  |0...............|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000000a0  00 00 00 00 00 00 00 00  c1 db f1 25 d3 d1 e1 57  |...........%...W|
000000b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000190  92 52 06 aa 38 00 32 2b  00 00 00 00 00 00 00 00  |.R..8.2+........|
000001a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000001b0  00 00 00 00 00 00 00 00  00 00 12 00 00 00 00 0a  |................|
000001c0  00 09 16 0a 00 09 16 09  00 09 16 00 00 00 00 00  |................|
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
blocktrron commented 1 year ago

Please upload the factory partition from mtd. You can create a copy using dd.

stweil commented 1 year ago

See factory.bin.

stweil commented 1 year ago

Please try a Firmware based on the next branch to asses if the issue was fixed in the driver.

As far as I see the master ~and next~ branch still uses an old revision d70546462b7b51ebc2bcdd5c534fdf3465be62a4 for mt76 and does not include the required minimum fix nor any others from the bug fixes which were committed later. So testing master would not be helpful. The next branch uses recent code for mt76, so hopefully will fix wifi related problems. I'll report as soon as the new firmware is running.

stweil commented 1 year ago

Node 2 and node 3 are now running with Gluon next firmware. I had to remove libustream-wolfssl from the list of GLUON_SITE_PACKAGES to avoid a build conflict with libustream-mbedtls (targets/generic also still refers to wolfssl but is not used in my build).

stweil commented 1 year ago

The status page of the updated nodes now looks different. "Kontakt" is shown (it wasn't with the previous firmware), and "Nachbarknoten" no longer displays a realtime diagram for the wireless mesh connections. See the node 1 for a status page with the previous firmware.

The status page also shows a JavaScript error for status-page.js:748 (tdDistance is undefined). The HTML codeis missing the attribute data-interface-wireless for the meshs (correct code would be <div data-interface="mesh0" data-interface-address="72:a2:96:cb:fe:81" data-interface-wireless="true">).

The missing /sys/devices/platform/1e140000.pcie/pci0000:00/0000:00:01.0/0000:02:00.0/net/mesh0/wireless causes the missing HTML attribute data-interface-wireless.

blocktrron commented 1 year ago

mt76 and does not include the required minimum fix

The fix you are referring to does only affect the situation in case the SS configuration is missing from EEPROM but it is present in the file you've linked as well as the hexdump posted earlier (193h --> AAh)

Th breakage on the satus-page can happen. next is not validated for any functionality

stweil commented 1 year ago

You are right, nss is (0xaa & 0x03) = 0x02. Nevertheless I see more than 1000 log entries like the one mentioned in https://github.com/openwrt/mt76/issues/690: Mar 16 12:17:44 Ladenburg-INTAKT-001 kernel: [62188.424399] mt7915e 0000:02:00.0: Message 00005aed (seq 9) timeout (that was the last one which I have seen on node 1 before installing the patched firmware).

stweil commented 1 year ago

Another log message which occurred often on all three nodes is this one:

INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.269001] WARNING: CPU: 1 PID: 3501 at backports-5.15.92-1/net/mac80211/mesh_hwmp.c:353 airtime_link_metric_get+0x120/0x684 [mac80211]
[...]
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.269825] CPU: 1 PID: 3501 Comm: respondd Tainted: G        W         5.10.168 #0
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.269833] Stack : 809a0000 00000580 807c62fc 8007c880 80840000 8072fa80 00000000 00000000
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.269895]         8478987c 80980000 806fff68 81de8f00 807c8dc7 00000001 84789820 4564a25f
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.269960]         00000000 00000000 806fff68 847896c0 c0000dc7 00000000 ffffffea 00000000
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270006]         847896cc 00000dc7 807cda90 20202020 80000000 00000001 00000000 80700000
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270049]         00000009 82d80da0 8236f1f0 00000580 00000018 ffffffff 00000004 80980004
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270093]         ...
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270118] Call Trace:
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270158] [<80007fd0>] show_stack+0x30/0x100
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270198] [<80320218>] dump_stack+0x9c/0xcc
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270229] [<8002ff20>] __warn+0xb0/0x11c
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270242] [<8002ffe8>] warn_slowpath_fmt+0x5c/0xac
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270311] [<82f6c498>] airtime_link_metric_get+0x120/0x684 [mac80211]
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270373] [<82f09004>] sta_set_sinfo+0x7e0/0xfec [mac80211]
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270436] [<82f23ff0>] ieee80211_nan_func_match+0x14bc/0x2214 [mac80211]
INFO:127.0.0.1:<4>Mar 21 13:00:39 Ladenburg-INTAKT-001 kernel: [399313.270545] [<82e2dc90>] nl80211_put_sta_rate+0x140c/0x15e0 [cfg80211]

It also occurs on the nodes with Gluon next based firmware (about 10 x per hour).

stweil commented 1 year ago

The breakage on the status page can happen [...]

Should I report that as a separate issue?

blocktrron commented 1 year ago

Please edit your comment instead of posting multiple related ones subsequently.

First of all, the warning you see is a bug in mac80211 in case no airtime metric has yet been acquired from the target. This can be ignored.

The message itself is very ambiguous, it can be insufficient power or misbehaving firmware. mt76 does allow for collecting firmware debug information, have you tried acquiring those?

stweil commented 1 year ago

The message itself is very ambiguous, it can be insufficient power or misbehaving firmware. mt76 does allow for collecting firmware debug information, have you tried acquiring those?

If you are referring to the mt7915e timeout message, that one no longer occurred after I installed firmware with newer mt76 code. But I now wrote 1 to /sys/kernel/debug/ieee80211/phy{0,1}/mt76/fw_debug_wm. That floods my logfile, so I had to disable it again. I can try to enable it after a new failure.

stweil commented 1 year ago

Meanwhile node 1 runs with firmware based on the stable release, but patched with a more recent mt76, since more than 8 days. This is the longest uninterrupted uptime for this node so far since its installation in December 2022. Up to now dmesg only shows these warnings, but none of the critical timeout messages:

WARNING: CPU: 3 PID: 28464 at backports-5.15.92-1/net/mac80211/mesh_hwmp.c:353 airtime_link_metric_get+0x120/0x684 [mac80211] (> 50 times, can be ignored according to the comment above)

br-client: received packet on bat0 with own address as source address (addr:b8:ec:a3:e0:f0:50, vlan:0) (> 100 times)

The nodes 2 and 3 which are less heavily used had reached up to 19 days uptime with an unpatched stable firmware. Both are running fine now with firmware based on gluon next since more than 3 days.

stweil commented 1 year ago

Node 1 failed at 19:53, but restarted by itself an hour later:

INFO:127.0.0.1:<3>Mar 25 19:53:47 Ladenburg-INTAKT-001 kernel: [769695.356542] mt7915e 0000:02:00.0: Message 00005aed (seq 9) timeout
[....]
INFO:127.0.0.1:<29>Mar 25 20:05:43 Ladenburg-INTAKT-001 netifd: wan6 (2577): Command failed: ubus call network.interface notify_proto { "action": 0, "link-up": true, "data": { }, "keep": false, "ip6addr": [ { "ipaddr": "2a01:598:c829:db5c:70a2:96ff:fecb:fe80", "mask": "64", "preferred": 14399, "valid": 86399, "offlink": true } ], "routes6": [ { "target": "::", "netmask": "0", "gateway": "fe80::a40e:6fff:fe23:c8c9", "metric": 640, "valid": 29 }, { "target": "2a01:598:c829:db5c::", "netmask": "64", "metric": 256, "valid": 86399 } ], "dns": [ "fe80::a40e:6fff:fe23:c8c9" ], "interface": "wan6" } (Request timed out)
[...]
INFO:127.0.0.1:<29>Mar 25 20:26:14 Ladenburg-INTAKT-001 netifd: client (2536): Command failed: ubus call network.interface notify_proto { "action": 0, "link-up": true, "data": { }, "keep": false, "ip6addr": [ { "ipaddr": "fdc3:67ce:cc7e:9001:baec:a3ff:fee0:f050", "mask": "64", "preferred": 14394, "valid": 86394, "offlink": true }, { "ipaddr": "2a01:4f8:140:7700:baec:a3ff:fee0:f050", "mask": "64", "preferred": 14394, "valid": 86394, "offlink": true }, { "ipaddr": "2a01:4f8:160:9700:baec:a3ff:fee0:f050", "mask": "64", "preferred": 14388, "valid": 86388, "offlink": true }, { "ipaddr": "2a01:4f8:171:fcff:baec:a3ff:fee0:f050", "mask": "64", "preferred": 14376, "valid": 86376, "offlink": true } ], "routes6": [ { "target": "fdc3:67ce:cc7e:9001::", "netmask": "64", "metric": 256, "valid": 86394 }, { "target": "2a01:4f8:140:7700::", "netmask": "64", "metric": 256, "valid": 86394 }, { "target": "2a01:4f8:160:9700::", "netmask": "64", "metric": 256, "valid": 86388 }, { "target": "2a01:4f8:171:fcff::", "netmask": "64", "metric": 256, "valid": 86376 }, { "
INFO:127.0.0.1:<30>Mar 25 20:57:30 Ladenburg-INTAKT-001 logread[1368]: Logread connected to 127.0.0.1:8888 via tcp
stweil commented 1 year ago

Node 1 now runs since 10 days with a firmware based on the next branch and latest openwrt and mt76.

I noticed that it does not get any owe client connections. iwinfo shows owe0 and owe1 with many unknown entries:

owe0      ESSID: "owe.freifunk-rhein-neckar.de"
          Access Point: 72:A2:96:CB:FE:82
          Mode: Master  Channel: 6 (2.437 GHz)  HT Mode: HE20
          Center Channel 1: 6 2: unknown
          Tx-Power: 20 dBm  Link Quality: unknown/70
          Signal: unknown  Noise: -91 dBm
          Bit Rate: unknown
          Encryption: WPA3 OWE (CCMP)
          Type: nl80211  HW Mode(s): 802.11ax/b/g/n
          Hardware: 14C3:7915 14C3:7915 [MediaTek MT7915E]
          TX power offset: none
          Frequency offset: none
          Supports VAPs: yes  PHY name: phy0

ifconfig shows that the owe interfaces send packages, but don't receive ones:

owe0      Link encap:Ethernet  HWaddr 72:A2:96:CB:FE:82  
          inet6 addr: fe80::70a2:96ff:fecb:fe82/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:21752335 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:1567629488 (1.4 GiB)

Update 2023-04-16: All routers (also those with stable firmware) show similar output for iwinfo and ifconfig as long as no client has connected. So that is no indicator for problems with owe connections. Nevertheless it is surprising that not a single successful owe connection happened since the firmware update ten days ago.

Djfe commented 1 year ago

Sounds like nbd found a way to fix the issue (see linked issue) I'm betting it was this one https://github.com/openwrt/openwrt/commit/c8963580aaf6a1207367c809d1350557d6e075e9

Should we consider backporting it to openwrt-22?

stweil commented 1 year ago

Should we consider backporting it to openwrt-22?

Backporting latest mt76 seems to be a good idea. Since 9 days our node Ladenburg-INTAKT-001 runs stable with firmware made from gluon next. The owe problem is gone, owe works again.

blocktrron commented 1 year ago

Current mt76 is not compatible with mac80211 used in OpenWrt 22.03.

What would be more productive is backporting (and validating!) said patch from nbd to 23.05.

stweil commented 1 year ago

The latest unstable firmware still fails with timeout very often. I don't know which patches it includes. @herbetom?

citronalco commented 1 year ago

One of our users also reported a non working WLAN mesh on one of his D-Link DAP-X1860 (A1) with Gluon 2022.1.4 and 2023.1. He has several of those devices. All of them do WLAN mesh properly, except a single one.

Unfortunately I don't have SSH access, so I cannot provide any logs.

stweil commented 1 year ago

Typically the nodes which are heavily used fail more often, sometimes several times per day, while other nodes in the same WLAN mesh with less clients can run several days or even weeks without failures.

Meanwhile I run a cron job which checks ~logread~ the output of dmesg for "timeout" and reboots the node if a timeout was detected. In addition it writes the current time to a logfile before rebooting. That is a workaround which works pretty well.

Djfe commented 1 year ago

actually there is an issue for meshing between ath10k/Qualcomm and mt76/Mediatek (wrong meshing code in Linux in general vs qualcomms own hw acceleration for the wifi frames in question) https://github.com/freifunk-gluon/gluon/pull/2905 v2023.1 works around this. gluon master has the full fix from upstream.

so yeah: recommend ueers who encounter issues with meshing to update to v2023.1 :)

citronalco commented 11 months ago

NWA50AX still loosing mesh and crashing on 2023.1.x