freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
537 stars 325 forks source link

mediatek-filogic: weird tq on wr3000 - wifi instability after few minutes #3305

Open maurerle opened 1 week ago

maurerle commented 1 week ago

General instability on mediatek filogic devices with mt7915e have been seen, especially on the WR3000, WAX220 and others. It has to be noted that some devices work better than others. Heavy wifi mesh seems to make the situation worse.

What is the problem?

An example of this is this behavior is this device: https://grafana.ffac.rocks/d/000000002/node?orgId=1&var-node=80afca06d558&from=1718344052951&to=1718403869219&viewPanel=13 image

which includes very varying TQ of the device.

The latest finding is this: https://grafana.ffac.rocks/d/000000002/node?orgId=1&var-node=80afca06d558&from=1720175532350&to=1720193698710&var-select_hostname=ffac-seilpforte-wr3000&var-hostname=ffac-seilpforte-wr3000&var-saveinterval=1m&var-nodetolink=0c0e76cf5d5e&viewPanel=13 image

At 1. I restarted the wifi driver using rmmod mt7915e && modprobe mt7915e At 2. I added another mesh device with which this device could mesh on mesh1, creating the timeout issue without the device being possible to reload the firmware At 3. I restarted the device, as nothing helped.

Afterward, the weird changing TQ can be seen, which behaves in weird waves.

The current workaround includes reloading the mt7915e driver and rebooting the device once the mt7915e bug from #3154 occurs. A package for this can be found here: https://github.com/ffac/gluon-packages/tree/main/ffac-mt7915-hotfix/files/lib/gluon/mt7915

As @nrbffs also noted on IRC, some other people reported instability with these devices as well. Currently, reloading the wifi driver twice a day seems to help in this situation..

This issue is not about #3154 but about the weird changing TQ leading to bad mesh quality and wifi quality.

What is the expected behaviour?

Mesh and wifi quality should be stable on mediatek filogic devices such as the WR3000.

Further steps

TX_Stats

I found that on other devices cat /sys/kernel/debug/ieee80211/phy1/mt76/tx_stats does only show values for 1 to 4 while the affected WR3000 has values for 1 to 8

Phy 0, Phy band 0
Length:        1 |   2 - 10 |  11 - 19 |  20 - 28 |  29 - 37 |  38 - 46 |  47 - 55 |  56 - 79 |  80 -103 | 104 -127 | 128 -151 | 152 -175 | 176 -199 | 200 -223 | 224 -247 | 
Count:      6743 |     5177 |      604 |      141 |      145 |        0 |        1 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 | 
BA miss count: 7072

Tx Beamformer applied PPDU counts: iBF: 0, eBF: 2461
Tx Beamformer Rx feedback statistics: All: 541, HE: 539, VHT: 2, HT: 0, BW20, NC: 8286, NR: 8589
Tx Beamformee successful feedback frames: 0
Tx Beamformee feedback triggered counts: 0
Tx multi-user Beamforming counts: 0
Tx multi-user MPDU counts: 0
Tx multi-user successful MPDU counts: 0
Tx single-user successful MPDU counts: 482790

Tx MSDU statistics:
AMSDU pack count of 1 MSDU in TXD:   227714 ( 99%)
AMSDU pack count of 2 MSDU in TXD:      180 (  0%)
AMSDU pack count of 3 MSDU in TXD:       86 (  0%)
AMSDU pack count of 4 MSDU in TXD:       71 (  0%)
AMSDU pack count of 5 MSDU in TXD:       39 (  0%)
AMSDU pack count of 6 MSDU in TXD:       31 (  0%)
AMSDU pack count of 7 MSDU in TXD:       20 (  0%)
AMSDU pack count of 8 MSDU in TXD:      129 (  0%)

I do not really know if this is related or not, just a finding.

Gluon Version: v2023.2.3

Site Configuration: ffac @ v2023.2.3-2

Custom patches: see site

blocktrron commented 6 days ago

Can you check if the tx retries / tx failed counters from iw dev mesh{0,1} station dump are continously incrementing?

maurerle commented 6 days ago

They are slightly increasing, but most of the time, they are constant.

tx failed

root@ffac-seilpforte-wr3000:~# iw dev mesh0 station dump | grep "tx failed"
    tx failed:  2228
    tx failed:  47
    tx failed:  2249
    tx failed:  127
    tx failed:  535

# after 10 minutes
root@ffac-seilpforte-wr3000:~# iw dev mesh0 station dump | grep "tx failed"
    tx failed:  2236
    tx failed:  85
    tx failed:  2259
    tx failed:  171
    tx failed:  535

tx retries

root@ffac-seilpforte-wr3000:~# iw dev mesh0 station dump | grep "tx retries"
    tx retries: 2223
    tx retries: 47
    tx retries: 2234
    tx retries: 124
    tx retries: 506

# after 10 minutes
root@ffac-seilpforte-wr3000:~# iw dev mesh0 station dump | grep "tx retries"
    tx retries: 2231
    tx retries: 85
    tx retries: 2243
    tx retries: 168
    tx retries: 506

batctl p towards some mesh partner often does not work either, with package losses above 90%. Does this help?

maurerle commented 1 day ago

I just tested the MTK patch: https://github.com/freifunk-gluon/gluon/commit/dd114b5fc2dec3f2e7feef52a7238399b39f0a9e from @blocktrron's branch: https://github.com/freifunk-gluon/gluon/compare/main...blocktrron:gluon:mtk-git-txs.patch

It looked good until I reloaded the driver at about 7:20 Then we had the usual airtime and link stability problems. Until I reloaded the driver again at 08:45. Problems then started again at 9:30

image

The logread still does not hint to something useful. So this issue is waiting for other ideas for now :)