aparcar / openwrt

Staging tree of Paul Spooren
Other
8 stars 1 forks source link

FS#176 - Disapearing radios for ath9k introduced between August 31 and September 4th #324

Closed aparcar closed 7 years ago

aparcar commented 8 years ago

moeller0:

I have encountered a similar issue to other ath9k users, after a while of uptime both the 2.4GHz and the 5GHz radios seem to disappear from the air; previously connected clients are disconnected first and the radios stay AWOl, until restarted (using "wifi" on the router's CLI or via enabling them againthe GUI). I have tested a number of hnyman's builds and pin pointed the issue to have happened between August 31st (good) and September 4th (bad). I would not be amazed if the powersave changes introduced between those two dates might be involved in the issue, but do not have clear evidence (yet).

Here is the output of a few potentially interesting diagnostic values, first from the working August 31st build, followed by two sections the September 4th build, first after boot-up with functional radios and then ~two hours later just after the radios disappeared. I would be happy to help get to the bottom of this issue...

echo “cat /etc/banner”; cat /etc/banner ; echo ""; echo "uptime" ; uptime ; echo ""; echo "iwinfo" ; iwinfo ; echo ""; echo "cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset" ; cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset ; echo ""; echo "cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset" ; cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset ; echo ""; echo "cat /sys/kernel/debug/ieee80211/phy0/ath9k/ani" ; cat /sys/kernel/debug/ieee80211/phy0/ath9k/ani

echo ""; echo "cat /sys/kernel/debug/ieee80211/phy1/ath9k/ani" ; cat /sys/kernel/debug/ieee80211/phy1/ath9k/ani ; echo ""; echo "iw dev wlan0 station dump" ; iw dev wlan0 station dump ; echo ""; echo "iw dev wlan1 station dump" ; iw dev wlan1 station dump

Last tested release without the disappearing radios: WNDR3700v2-lede-r1476-20160831-1058-sqfs-sysupgrade.bin BusyBox v1.24.2 () built-in shell (ash)

 _________
/        /\      _    ___ ___  ___

/ LE / \ | | | | | | / DE / \ | || || |) | | /__/ LE \ |__|__|/|_| lede-project.org \ \ DE / \ LE \ / ----------------------------------------------------------- \ DE \ / Reboot (HEAD, r1476) ______\/ -----------------------------------------------------------

root@nacktmulle:~# uptime 21:55:34 up 5:23, load average: 0.17, 0.09, 0.01

root@nacktmulle:~# iwinfo wlan0 ESSID: "nacktmulle_2.4GHz" Access Point: A0:21:B7:B9:5C:22 Mode: Master Channel: 11 (2.462 GHz) Tx-Power: 26 dBm Link Quality: 51/70 Signal: -59 dBm Noise: -95 dBm Bit Rate: 65.0 MBit/s Encryption: WPA2 PSK (CCMP) Type: nl80211 HW Mode(s): 802.11bgn Hardware: 168C:0029 168C:A095 [Atheros AR9223] TX power offset: none Frequency offset: none Supports VAPs: yes PHY name: phy0

wlan1 ESSID: "nacktmulle_5GHz" Access Point: A0:21:B7:B9:5C:24 Mode: Master Channel: 44 (5.220 GHz) Tx-Power: 17 dBm Link Quality: 55/70 Signal: -55 dBm Noise: -95 dBm Bit Rate: 113.1 MBit/s Encryption: WPA2 PSK (CCMP) Type: nl80211 HW Mode(s): 802.11an Hardware: 168C:0029 168C:A094 [Atheros AR9220] TX power offset: none Frequency offset: none Supports VAPs: yes PHY name: phy1

root@nacktmulle:~# cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 0 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 142 Stuck Beacon: 258 MCI Reset: 0 Calibration error: 0 Tx DMA stop error: 91 Rx DMA stop error: 0

root@nacktmulle:~# cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 0 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 1112 Stuck Beacon: 0 MCI Reset: 0 Calibration error: 0 Tx DMA stop error: 1112 Rx DMA stop error: 0

root@nacktmulle:~# cat /sys/kernel/debug/ieee80211/phy0/ath9k/ani ANI: ENABLED ANI RESET: 469 OFDM LEVEL: 0 CCK LEVEL: 0 SPUR UP: 7135 SPUR DOWN: 7135 OFDM WS-DET ON: 1 OFDM WS-DET OFF: 1 MRC-CCK ON: 0 MRC-CCK OFF: 0 FIR-STEP UP: 7179 FIR-STEP DOWN: 7564 INV LISTENTIME: 0 OFDM ERRORS: 17874314 CCK ERRORS: 366582

root@nacktmulle:~# cat /sys/kernel/debug/ieee80211/phy1/ath9k/ani ANI: ENABLED ANI RESET: 1120 OFDM LEVEL: 0 CCK LEVEL: 0 SPUR UP: 69 SPUR DOWN: 69 OFDM WS-DET ON: 0 OFDM WS-DET OFF: 0 MRC-CCK ON: 0 MRC-CCK OFF: 0 FIR-STEP UP: 69 FIR-STEP DOWN: 734 INV LISTENTIME: 0 OFDM ERRORS: 132975 CCK ERRORS: 0

root@nacktmulle:~# iw dev wlan0 station dump Station a0:02:dc:07:d9:06 (on wlan0) inactive time: 3950 ms rx bytes: 4195114 rx packets: 31910 tx bytes: 86381863 tx packets: 58702 tx retries: 1007 tx failed: 25 signal: -61 [-70, -62] dBm signal avg: -61 [-70, -62] dBm tx bitrate: 65.0 MBit/s MCS 7 rx bitrate: 1.0 MBit/s expected throughput: 32.42Mbps authorized: yes authenticated: yes preamble: short WMM/WME: yes MFP: no TDLS peer: no connected time: 19282 seconds

root@nacktmulle:~# iw dev wlan1 station dump Station 10:68:3f:4b:0b:48 (on wlan1) inactive time: 23020 ms rx bytes: 1351656 rx packets: 10344 tx bytes: 6353661 tx packets: 8128 tx retries: 86 tx failed: 1400 signal: -32 [-32, -40] dBm signal avg: -23 [-23, -35] dBm tx bitrate: 65.0 MBit/s MCS 7 rx bitrate: 6.0 MBit/s expected throughput: 32.42Mbps authorized: yes authenticated: yes preamble: short WMM/WME: yes MFP: no TDLS peer: no connected time: 18474 seconds Station 78:f8:82:9f:3c:47 (on wlan1) inactive time: 28460 ms rx bytes: 23682809 rx packets: 199402 tx bytes: 142204276 tx packets: 138582 tx retries: 10586 tx failed: 553 signal: -60 [-62, -65] dBm signal avg: -63 [-64, -66] dBm tx bitrate: 180.0 MBit/s MCS 12 40MHz short GI rx bitrate: 6.0 MBit/s expected throughput: 50.811Mbps authorized: yes authenticated: yes preamble: short WMM/WME: yes MFP: no TDLS peer: no connected time: 12727 seconds Station a4:d1:8c:5e:12:20 (on wlan1) inactive time: 24660 ms rx bytes: 114469 rx packets: 1036 tx bytes: 20303 tx packets: 81 tx retries: 0 tx failed: 6 signal: -56 [-59, -59] dBm signal avg: -55 [-58, -57] dBm tx bitrate: 65.0 MBit/s MCS 7 rx bitrate: 24.0 MBit/s expected throughput: 58.593Mbps authorized: yes authenticated: yes preamble: long WMM/WME: yes MFP: no TDLS peer: no connected time: 2123 seconds Station 64:bc:0c:83:f5:d5 (on wlan1) inactive time: 1470 ms rx bytes: 1478823 rx packets: 14914 tx bytes: 2861270 tx packets: 5818 tx retries: 243 tx failed: 74 signal: -58 [-62, -60] dBm signal avg: -59 [-64, -61] dBm tx bitrate: 162.0 MBit/s MCS 12 40MHz rx bitrate: 6.0 MBit/s expected throughput: 49.163Mbps authorized: yes authenticated: yes preamble: long WMM/WME: yes MFP: no TDLS peer: no connected time: 561 seconds

First tested release with disappearing radios: WNDR3700v2-lede-r1497-20160904-1350-sqfs-sysupgrade.bin While radios are still up: BusyBox v1.24.2 () built-in shell (ash)

 _________
/        /\      _    ___ ___  ___

/ LE / \ | | | | | | / DE / \ | || || |) | | /__/ LE \ |__|__|/|_| lede-project.org \ \ DE / \ LE \ / ----------------------------------------------------------- \ DE \ / Reboot (HEAD, r1497) ______\/ -----------------------------------------------------------

root@nacktmulle:~# uptime 22:04:18 up 3 min, load average: 0.61, 0.52, 0.21

root@nacktmulle:~# iwinfo wlan0 ESSID: "nacktmulle_2.4GHz" Access Point: A0:21:B7:B9:5C:22 Mode: Master Channel: 11 (2.462 GHz) Tx-Power: 26 dBm Link Quality: 70/70 Signal: -22 dBm Noise: -94 dBm Bit Rate: 65.0 MBit/s Encryption: WPA2 PSK (CCMP) Type: nl80211 HW Mode(s): 802.11bgn Hardware: 168C:0029 168C:A095 [Atheros AR9223] TX power offset: none Frequency offset: none Supports VAPs: yes PHY name: phy0

wlan1 ESSID: "nacktmulle_5GHz" Access Point: A0:21:B7:B9:5C:24 Mode: Master Channel: 44 (5.220 GHz) Tx-Power: 17 dBm Link Quality: 57/70 Signal: -53 dBm Noise: -95 dBm Bit Rate: 285.0 MBit/s Encryption: WPA2 PSK (CCMP) Type: nl80211 HW Mode(s): 802.11an Hardware: 168C:0029 168C:A094 [Atheros AR9220] TX power offset: none Frequency offset: none Supports VAPs: yes PHY name: phy1

root@nacktmulle:~# cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 0 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 0 Stuck Beacon: 1 MCI Reset: 0 Calibration error: 0 Tx DMA stop error: 0 Rx DMA stop error: 0

root@nacktmulle:~# cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 0 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 0 Stuck Beacon: 0 MCI Reset: 0 Calibration error: 0 Tx DMA stop error: 0 Rx DMA stop error: 0

root@nacktmulle:~# cat /sys/kernel/debug/ieee80211/phy0/ath9k/ani ANI: ENABLED ANI RESET: 70 OFDM LEVEL: 0 CCK LEVEL: 0 SPUR UP: 20 SPUR DOWN: 20 OFDM WS-DET ON: 0 OFDM WS-DET OFF: 0 MRC-CCK ON: 0 MRC-CCK OFF: 0 FIR-STEP UP: 20 FIR-STEP DOWN: 24 INV LISTENTIME: 0 OFDM ERRORS: 52409 CCK ERRORS: 150

root@nacktmulle:~# cat /sys/kernel/debug/ieee80211/phy1/ath9k/ani ANI: ENABLED ANI RESET: 8 OFDM LEVEL: 0 CCK LEVEL: 0 SPUR UP: 0 SPUR DOWN: 0 OFDM WS-DET ON: 0 OFDM WS-DET OFF: 0 MRC-CCK ON: 0 MRC-CCK OFF: 0 FIR-STEP UP: 0 FIR-STEP DOWN: 2 INV LISTENTIME: 0 OFDM ERRORS: 869 CCK ERRORS: 0

root@nacktmulle:~# iw dev wlan0 station dump Station 10:68:3f:4b:0b:48 (on wlan0) inactive time: 9340 ms rx bytes: 164542 rx packets: 1082 tx bytes: 1157129 tx packets: 1470 tx retries: 22 tx failed: 31 signal: -21 [-27, -22] dBm signal avg: -24 [-30, -25] dBm tx bitrate: 65.0 MBit/s MCS 7 rx bitrate: 6.0 MBit/s expected throughput: 32.42Mbps authorized: yes authenticated: yes preamble: short WMM/WME: yes MFP: no TDLS peer: no connected time: 108 seconds

root@nacktmulle:~# iw dev wlan1 station dump Station a4:d1:8c:5e:12:20 (on wlan1) inactive time: 24510 ms rx bytes: 39626 rx packets: 222 tx bytes: 16338 tx packets: 49 tx retries: 0 tx failed: 2 signal: -55 [-60, -57] dBm signal avg: -54 [-58, -56] dBm tx bitrate: 270.0 MBit/s MCS 14 40MHz short GI rx bitrate: 24.0 MBit/s expected throughput: 57.219Mbps authorized: yes authenticated: yes preamble: long WMM/WME: yes MFP: no TDLS peer: no connected time: 100 seconds Station 78:f8:82:9f:3c:47 (on wlan1) inactive time: 8320 ms rx bytes: 66313 rx packets: 552 tx bytes: 182017 tx packets: 393 tx retries: 37 tx failed: 7 signal: -57 [-61, -59] dBm signal avg: -55 [-59, -58] dBm tx bitrate: 300.0 MBit/s MCS 15 40MHz short GI rx bitrate: 6.0 MBit/s expected throughput: 58.593Mbps authorized: yes authenticated: yes preamble: long WMM/WME: yes MFP: no TDLS peer: no connected time: 86 seconds root@nacktmulle:~#

After disappearing networks: root@nacktmulle:~# echo “cat /etc/banner”; cat /etc/banner ; echo ""; echo "uptime" ; uptime ; echo ""; echo "iwinfo" ; iwinfo ; echo ""; echo "cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset" ; cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset ; echo ""; echo "cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset" ; cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset ; echo ""; echo "cat /sys/ker nel/debug/ieee80211/phy0/ath9k/ani" ; cat /sys/kernel/debug/ieee80211/phy0/ath9k/ani “cat /etc/banner”


/        /\      _    ___ ___  ___

/ LE / \ | | | | | | / DE / \ | || || |) | | /__/ LE \ |__|__|/|_| lede-project.org \ \ DE / \ LE \ / ----------------------------------------------------------- \ DE \ / Reboot (HEAD, r1497) ______\/ -----------------------------------------------------------

uptime 23:52:32 up 1:51, load average: 0.71, 0.76, 0.54

iwinfo wlan0 ESSID: "nacktmulle_2.4GHz" Access Point: A0:21:B7:B9:5C:22 Mode: Master Channel: 11 (2.462 GHz) Tx-Power: 26 dBm Link Quality: unknown/70 Signal: unknown Noise: -95 dBm Bit Rate: unknown Encryption: WPA2 PSK (CCMP) Type: nl80211 HW Mode(s): 802.11bgn Hardware: 168C:0029 168C:A095 [Atheros AR9223] TX power offset: none Frequency offset: none Supports VAPs: yes PHY name: phy0

wlan1 ESSID: "nacktmulle_5GHz" Access Point: A0:21:B7:B9:5C:24 Mode: Master Channel: 44 (5.220 GHz) Tx-Power: 17 dBm Link Quality: unknown/70 Signal: unknown Noise: -95 dBm Bit Rate: unknown Encryption: WPA2 PSK (CCMP) Type: nl80211 HW Mode(s): 802.11an Hardware: 168C:0029 168C:A094 [Atheros AR9220] TX power offset: none Frequency offset: none Supports VAPs: yes PHY name: phy1

cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 1 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 18 Stuck Beacon: 1701 MCI Reset: 0 Calibration error: 0 Tx DMA stop error: 224 Rx DMA stop error: 0

cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 0 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 39 Stuck Beacon: 3026 MCI Reset: 0 Calibration error: 1 Tx DMA stop error: 51 Rx DMA stop error: 1

cat /sys/kernel/debug/ieee80211/phy0/ath9k/ani ANI: ENABLED ANI RESET: 1789 OFDM LEVEL: 3 CCK LEVEL: 2 SPUR UP: 2150 SPUR DOWN: 2150 OFDM WS-DET ON: 0 OFDM WS-DET OFF: 0 MRC-CCK ON: 0 MRC-CCK OFF: 0 FIR-STEP UP: 2144 FIR-STEP DOWN: 2244 INV LISTENTIME: 0 OFDM ERRORS: 5399567 CCK ERRORS: 28670 root@nacktmulle:~#

cat /sys/kernel/debug/ieee80211/phy1/ath9k/ani ANI: ENABLED ANI RESET: 3360 OFDM LEVEL: 3 CCK LEVEL: 2 SPUR UP: 5 SPUR DOWN: 5 OFDM WS-DET ON: 0 OFDM WS-DET OFF: 0 MRC-CCK ON: 0 MRC-CCK OFF: 0 FIR-STEP UP: 5 FIR-STEP DOWN: 23 INV LISTENTIME: 0 OFDM ERRORS: 21751 CCK ERRORS: 0

iw dev wlan0 station dump

iw dev wlan1 station dump

aparcar commented 8 years ago

moeller0:

Forgot to mention, device is a netgear wndr 3700v2 with Atheros AR9280 Rev:2 radios.

Definitions: working: the radios stay visible and useable for around 24 hours affected: the radios disappear and do not re-appear until the wifi command is issued at the router's CLI.

Testing: Move nexus5X into area of relative bad reception, disable mobile data and run netalyzr and wifi analyzer apps to generate siome traffic and see the BSIDs go away. Affected build will loose the radio in typically less than 5 minutes with this test. After issueing wifi the test is repeated. If the test does not make the radios go away the build is further tested with normal usage (which in affected build will also make the radios disappear in a few minutes to a few hours).

Tested versions (I will try to update this list):

Working: radios stay enabled and useable:

lede-r1212-20160801 lede-r1297-20160812 lede-r1398-20160822 lede-r1442-20160825-8859102 lede-r1462-20160829 lede-r1476-20160831 lede-r1481-20160902-1e72d1b-mac80211_add_a_powersave_handling_fix (no failure in 24h) lede-r1481-20160902-1e72d1b-mac80211_add_a_powersave_handling_fix (no failure in 48 hours)

"Affected": radios disappear (seemingly partially triggered by moving into an area with bad reception)

lede-r1482-20160902-372d0fe-ath9k_add_a_bunch_of_powersave_handling (no failures for 12 hours, 4 failures in 7 hours) lede-r1483-20160902-a894a53-mac80211_add_fixes_for_dealing_with_unex (failures in 5 minutes) lede-r1491-20160902-dbc9ee5-ath9k_fix_regression_in_tx_queueing_patc lede-r1497-20160904 lede-r1512-20160905 lede-r1535-20160908 lede-r1600-20160915

These tests pretty much indicate commit https://github.com/lede-project/source/commit/372d0fea29e60b02154fd7176ba32e7742f6640e as having introduced the regression. The next question is which of the 5 patches consituting that commit is the culprit? To test this, I will test the following r1630 builds: lede-r1630-20160919-no339-343 lede-r1630-20160919-with339_no340-343 lede-r1630-20160919-with339-340_no341-343 lede-r1630-20160919-with339-341_no342-343 lede-r1630-20160919-with339-342_no343

Working: radios stay enabled and useable:

lede-r1630-20160919-no339-343 (uptime 3 days) lede-r1630-20160919-with339-340_no341-343 (no loss event in 3 days, 15:39 hours:minutes)

Testing

lede-r1708-20160928-with-ath9k-remove-patch-causing-stability-issues-with-power

"Affected": radios disappear (seemingly partially triggered by moving into an area with bad reception)

lede-r1630-20160919-with339-341_no342-343 (2 loss events of both radios in 6 minutes uptime) lede-r1630-20160919-with339-342_no343 (2 loss events of both radios in 18 minutes uptime)

aparcar commented 8 years ago

glycoknob:

@moeller0: Can you also trigger the issue when you download data with 2 wifi clients at the same time? On my 1043ndv1 this reliably triggers the disappearing AP. I'm also quite sure the issue exists longer than August 31. There are similar tickets here:

https://bugs.lede-project.org/index.php?do=details&task_id=34 https://bugs.lede-project.org/index.php?do=details&task_id=13 https://bugs.lede-project.org/index.php?do=details&task_id=166

aparcar commented 8 years ago

nbd:

Please try the latest version

aparcar commented 8 years ago

moeller0:

@Martin Tippmann, I have not tried that, in affected builds I typically loose the BSIDs in a few minutes to a few hours by simple usage. The radios suddenly go off air, so on nexus5X devices it goes from connected to disconnected and no BSID visible anymore instantaneously. Timing wise I suspect the powersave fixes that went in around early September to be involved, but I have not yet tested that (E_OUT_OF_TIME yesterday).

@Felix: I will try r1600 will that be suffieciently recent?

aparcar commented 8 years ago

nbd:

Yes, r1600 is sufficiently recent.

aparcar commented 8 years ago

moeller0:

Okay, I tested r1600 and it is still broken, I lost both radios by simply moving my nexus5X into the next room.

The diagnostic messages are slightly different but the result is the same, the radios disappeared, and issueing "wifi" on the router's CLI will bring them back temporarily.

I assume that going back to figuring out which commit introduced that behavior is the best course of action right now.

root@nacktmulle:~# echo “cat /etc/banner”; cat /etc/banner ; echo ""; echo "uptime" ; uptime ; echo ""; echo "iwinfo" ; iwinfo ; echo ""; echo "cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset" ; cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset ; echo ""; echo "cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset" ; cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset ; echo ""; echo "cat /sys/ker nel/debug/ieee80211/phy0/ath9k/ani" ; cat /sys/kernel/debug/ieee80211/phy0/ath9k/ani “cat /etc/banner”


/        /\      _    ___ ___  ___

/ LE / \ | | | | | | / DE / \ | || || |) | | /__/ LE \ |__|__|/|_| lede-project.org \ \ DE / \ LE \ / ----------------------------------------------------------- \ DE \ / Reboot (HEAD, r1600) ______\/ -----------------------------------------------------------

uptime 21:02:20 up 5 min, load average: 0.35, 0.52, 0.27

iwinfo wlan0 ESSID: "nacktmulle_2.4GHz" Access Point: A0:21:B7:B9:5C:22 Mode: Master Channel: 11 (2.462 GHz) Tx-Power: 26 dBm Link Quality: 49/70 Signal: -61 dBm Noise: -95 dBm Bit Rate: 130.0 MBit/s Encryption: WPA2 PSK (CCMP) Type: nl80211 HW Mode(s): 802.11bgn Hardware: 168C:0029 168C:A095 [Atheros AR9223] TX power offset: none Frequency offset: none Supports VAPs: yes PHY name: phy0

wlan1 ESSID: "nacktmulle_5GHz" Access Point: A0:21:B7:B9:5C:24 Mode: Master Channel: 44 (5.220 GHz) Tx-Power: 17 dBm Link Quality: 62/70 Signal: -48 dBm Noise: -95 dBm Bit Rate: 173.7 MBit/s Encryption: WPA2 PSK (CCMP) Type: nl80211 HW Mode(s): 802.11an Hardware: 168C:0029 168C:A094 [Atheros AR9220] TX power offset: none Frequency offset: none Supports VAPs: yes PHY name: phy1

cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 1 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 9 Stuck Beacon: 457 MCI Reset: 0 Calibration error: 0 Tx DMA stop error: 10 Rx DMA stop error: 0

cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 1 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 0 Stuck Beacon: 619 MCI Reset: 0 Calibration error: 0 Tx DMA stop error: 16 Rx DMA stop error: 0

cat /sys/kernel/debug/ieee80211/phy0/ath9k/ani ANI: ENABLED ANI RESET: 537 OFDM LEVEL: 3 CCK LEVEL: 2 SPUR UP: 47 SPUR DOWN: 47 OFDM WS-DET ON: 0 OFDM WS-DET OFF: 0 MRC-CCK ON: 0 MRC-CCK OFF: 0 FIR-STEP UP: 47 FIR-STEP DOWN: 49 INV LISTENTIME: 0 OFDM ERRORS: 87331 CCK ERRORS: 1773 root@nacktmulle:~# root@nacktmulle:~# echo ""; echo "cat /sys/kernel/debug/ieee80211/phy1/ath9k/ani" ; cat /sys/kernel/debug/ieee80211/phy1/ath9k/ani ; echo ""; echo "iw dev wlan0 station dump" ; iw dev wlan0 station dump ; echo ""; echo "iw dev wlan1 station dump" ; iw dev wlan1 st ation dump

cat /sys/kernel/debug/ieee80211/phy1/ath9k/ani ANI: ENABLED ANI RESET: 655 OFDM LEVEL: 3 CCK LEVEL: 2 SPUR UP: 2 SPUR DOWN: 2 OFDM WS-DET ON: 0 OFDM WS-DET OFF: 0 MRC-CCK ON: 0 MRC-CCK OFF: 0 FIR-STEP UP: 2 FIR-STEP DOWN: 4 INV LISTENTIME: 0 OFDM ERRORS: 2567 CCK ERRORS: 0

iw dev wlan0 station dump Station 78:f8:82:9f:3c:47 (on wlan0) inactive time: 101230 ms rx bytes: 50212 rx packets: 478 tx bytes: 170661 tx packets: 345 tx retries: 10 tx failed: 0 signal: -61 [-62, -68] dBm signal avg: -58 [-60, -63] dBm tx bitrate: 130.0 MBit/s MCS 15 rx bitrate: 1.0 MBit/s expected throughput: 45.226Mbps authorized: yes authenticated: yes preamble: short WMM/WME: yes MFP: no TDLS peer: no connected time: 109 seconds

iw dev wlan1 station dump Station 10:68:3f:4b:0b:48 (on wlan1) inactive time: 143990 ms rx bytes: 109126 rx packets: 716 tx bytes: 334186 tx packets: 585 tx retries: 24 tx failed: 14 signal: -12 [-12, -42] dBm signal avg: -35 [-35, -47] dBm tx bitrate: 65.0 MBit/s MCS 7 rx bitrate: 6.0 MBit/s expected throughput: 32.42Mbps authorized: yes authenticated: yes preamble: short WMM/WME: yes MFP: no TDLS peer: no connected time: 238 seconds Station 78:f8:82:9f:3c:47 (on wlan1) inactive time: 132890 ms rx bytes: 341146 rx packets: 3023 tx bytes: 499559 tx packets: 1756 tx retries: 261 tx failed: 26 signal: -74 [-84, -75] dBm signal avg: -75 [-83, -77] dBm tx bitrate: 90.0 MBit/s MCS 4 40MHz short GI rx bitrate: 6.0 MBit/s expected throughput: 38.268Mbps authorized: yes authenticated: yes preamble: short WMM/WME: yes MFP: no TDLS peer: no connected time: 220 seconds Station a4:d1:8c:5e:12:20 (on wlan1) inactive time: 133200 ms rx bytes: 39578 rx packets: 223 tx bytes: 15300 tx packets: 46 tx retries: 1 tx failed: 2 signal: -54 [-62, -54] dBm signal avg: -54 [-62, -54] dBm tx bitrate: 270.0 MBit/s MCS 15 40MHz rx bitrate: 24.0 MBit/s expected throughput: 57.219Mbps authorized: yes authenticated: yes preamble: long WMM/WME: yes MFP: no TDLS peer: no connected time: 209 seconds root@nacktmulle:~#

aparcar commented 8 years ago

moeller0:

Currently testing: lede-r1481-20160902-1e72d1b-mac80211_add_a_powersave_handling_fix

Okay, this does survive the "move nexus5x to bad reception area" test repeatedly, so I will test this for ~24 hours before declaring it "working"

aparcar commented 8 years ago

moeller0:

Update for BSID loss on netgear wndr3700v2 with Atheros AR9280 Rev:2 radios. (the changes are also included as updates to comment #1, but to re-trigger the email function here is a new comment)

Definitions: working: the radios stay visible and useable for at least around 24 hours affected: the radios disappear and do not re-appear until the wifi command is issued at the router's CLI.

Testing: Move nexus5X into area of relative bad reception, disable mobile data and run netalyzr and wifi analyzer apps to generate siome traffic and see the BSIDs go away. Affected build will loose the radio in typically less than 5 minutes with this test. After issueing wifi the test is repeated. If the test does not make the radios go away the build is further tested with normal usage (which in affected build will also make the radios disappear in a few minutes to a few hours).

Tested versions (I will try to update this list):

Working: radios stay enabled and useable:

lede-r1212-20160801 lede-r1297-20160812 lede-r1398-20160822 lede-r1442-20160825-8859102 lede-r1462-20160829 lede-r1476-20160831 lede-r1481-20160902-1e72d1b-mac80211_add_a_powersave_handling_fix (no failure in 24h)

Currently testing:

lede-r1481-20160902-1e72d1b-mac80211_add_a_powersave_handling_fix (long term test)

"Affected": radios disappear (seemingly partially triggered by moving into an area with bad reception)

lede-r1482-20160902-372d0fe-ath9k_add_a_bunch_of_powersave_handling (no failures for 12 hours, 4 failures in 7 hours) lede-r1483-20160902-a894a53-mac80211_add_fixes_for_dealing_with_unex (failures in 5 minutes) lede-r1491-20160902-dbc9ee5-ath9k_fix_regression_in_tx_queueing_patc lede-r1497-20160904 lede-r1512-20160905 lede-r1535-20160908 lede-r1600-20160915

Hypothesis: r1482 somehow introduced a new or exposed an old and hidden bug for the wndr3700v2' radios, r1483 might either have introduced an easier way to trigger the offensive conditions, but that is conjecture...

aparcar commented 8 years ago

nbd:

Please try a current build with patches 340-343 deleted from package/kernel/mac80211/patches. If that works well, please re-add them one by one and let me know which one is introducing the regression. Thanks for your work in tracking this down!

aparcar commented 8 years ago

hnyman:

Well, looks like I need to compile new debug builds for moeller0...

@nbd Any guess about the possible culprit patch in 340-343? Which one to try first?

aparcar commented 8 years ago

nbd:

Let's try a build with all of them removed first, to check if we have remaining regressions in the other patches. Actually, I just noticed that 339 needs to be considered as well, since it was part of the commit that introduced the regression in the first place. I guess I would consider 341 the most likely candidate for causing this regression.

aparcar commented 8 years ago

moeller0:

@Hannu: Thanks for your help! (I will also try to set up your build environment on a local computer so that I do not need to bother you constantly) @Felix: will test the r1630 witout 39 to 343 that Hannu prepared once I am back at the router later today.

aparcar commented 8 years ago

hnyman:

@moeller0 I have so far built lede-r1630-20160919-no339-343 lede-r1630-20160919-with339_no340-343 lede-r1630-20160919-with339-340_no341-343 lede-r1630-20160919-with339-341_no342-343 lede-r1630-20160919-with339-342_no343

They are found in the normal place at dropbox: https://www.dropbox.com/sh/t52c02rm20y8x9p/AADVAy3PjDxTN1U4TVMTrYrqa/lede-wifi-debug?dl=0

I have included also the kmod-mac80211*.ipk package separately for each incremental build, so you might try installing just that package with opkg and rebooting. (of course, only after first flashing the base build of no339-343).

If you try setting up a build environment similar to mine, use the advice from https://forum.openwrt.org/viewtopic.php?pid=127011#p127011 and apply the newBuildroot.sh script. It takes me only about 5 minutes to transfer my whole environment with all the patches etc. to a new Ubuntu x64 instance in Virtualbox.

aparcar commented 8 years ago

nbd:

That doesn't work. kmod-mac80211 is not what changes by messing with these patches. You need kmod-ath9k here.

aparcar commented 8 years ago

hnyman:

oops. then that doesn't work for the builds that I already uploaded. My bad.

aparcar commented 8 years ago

moeller0:

@Hannu: Thanks for all the support; I am currently testing lede-r1630-20160919-no339-343 and in 12 hours failed to make it loose its radios (even though I tried, interestingly looking at the rest statistics either that build is better than <= r1481 or last night there was less radio pollution around). But I will continue the test probably until tomorrow so ~36 hours just to confirm that it is really is a working build. Then I plan to process your list from the end (as failures are typically quicker to manifest in affected builds than it is to be reasonably convinced an affected build truly is "good"). I will also set up the build environment, what version of ubuntu are you using? I have a virtual machine with ubutuu 16.04LTS x86 with 28GB free disk space, will that suffice or do I need more space?

aparcar commented 8 years ago

hnyman:

I have Ubuntu 16.04.1 x64. Set the disk space of the virtual drive to get allocated dynamically so you can set the max size to something like 50 GB.

aparcar commented 8 years ago

moeller0:

Quick update, lede-r1630-20160919-no339-343 survived for 36 hours without loosing its radios, I assume it to be working now. I will only be able to re-flash late tonight, soit will get a bit more testing exposure before I try the other builds. Oh, I have not managed to set up the VM yet, E_TOO_MUCHWORK@_WORK ;)

Best Regards

aparcar commented 8 years ago

Bluse:

Hi all,

Within our AP zoo I did found a TPLink TL-WR941nd showing the same symptomes of disapearing AP after some minutes.

lede version: snapshot image (HEAD, r1619) - so all power save related patches are included

My observations I want to add are:

while true; do cat /sys/kernel/debug/ieee80211/phy0/total_ps_buffered, date, sleep 8; done

Sat Sep 17 22:44:03 UTC 2016 0 Sat Sep 17 22:44:11 UTC 2016 0 Sat Sep 17 22:44:19 UTC 2016 1 Sat Sep 17 22:44:27 UTC 2016 2 Sat Sep 17 22:44:37 UTC 2016 3 Sat Sep 17 22:44:45 UTC 2016 6 Sat Sep 17 22:44:53 UTC 2016 8 Sat Sep 17 22:45:01 UTC 2016 11 Sat Sep 17 22:45:09 UTC 2016 12 Sat Sep 17 22:45:17 UTC 2016 15 ...

Greetings from Berlin Bluse

aparcar commented 8 years ago

moeller0:

So here is the result of testing r1630 without patches 339-343 for ~3 days:

root@router:~# uptime 20:47:55 up 2 days, 23:59, load average: 0.00, 0.04, 0.01 root@router:~# cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 0 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 1321 Stuck Beacon: 4564 MCI Reset: 0 Calibration error: 2 Tx DMA stop error: 501 Rx DMA stop error: 0 root@router:~# cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 0 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 1676 Stuck Beacon: 0 MCI Reset: 0 Calibration error: 0 Tx DMA stop error: 1676 Rx DMA stop error: 0

As far as I can tell this is quite stable. Now onto testing with incrementally including those patches individually.

aparcar commented 8 years ago

moeller0:

The tests so far pretty much indicate commit https://github.com/lede-project/source/commit/372d0fea29e60b02154fd7176ba32e7742f6640e as having introduced the regression.

The next question is which of the 5 patches constituting that commit is the culprit? To test this, I will test the following r1630 builds: lede-r1630-20160919-no339-343 lede-r1630-20160919-with339_no340-343 lede-r1630-20160919-with339-340_no341-343 lede-r1630-20160919-with339-341_no342-343 lede-r1630-20160919-with339-342_no343

Working: radios stay enabled and useable:

lede-r1630-20160919-no339-343 (uptime 3 days) lede-r1630-20160919-with339-340_no341-343 (no loss event in 3 days, 15:39 hours:minutes)

Testing

lede-r1708-20160928-with-ath9k-remove-patch-causing-stability-issues-with-power

"Affected": radios disappear (seemingly partially triggered by moving into an area with bad reception)

lede-r1630-20160919-with339-341_no342-343 (2 loss events of both radios in 6 minutes uptime) lede-r1630-20160919-with339-342_no343 (2 loss events of both radios in 18 minutes uptime)

aparcar commented 8 years ago

hnyman:

Looks like Felix' guess above that 341 is the likely culprit, may have been right on the spot.

(Ps. I tried to make build just without 341, but as 343 touches the same code in xmit.c around line 1690, I was not sure what the end-result should be...)

aparcar commented 8 years ago

moeller0:

Hi Hannu,

yes it looks like 341 is special, but I will continue testing the build without 341-343 as occasionally I took a while fr the issue to manifest. And 341 might not actually contain a bug per se, but might simple make it (massively) easier to trigger an existing bug (soft- or hardware). Will report back after the weekend.

aparcar commented 8 years ago

joaochainho:

Hi,

I don't know if it's helpful, but I'm having similar problems with a WNDR3800 running recent builds (r1491, r1516, r1616). I have to restart the interface to get it working again. ath: phy1: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42100020 DMADBG_7=0x000062c0

aparcar commented 8 years ago

moeller0:

Update: I tested

lede-r1630-20160919-with339-340_no341-343 (no loss event 3 days, 15:39 hours:minutes)

This seems to indicate that at least the old radio chips in the wnder3700v2 do not like patch 241 too well. Anything else I should test?

aparcar commented 8 years ago

moeller0:

@https://bugs.lede-project.org/index.php?do=user&area=users&id=165 I get the same error messages when the radios disappear, which is interesting as I believe the output of these errors was supposed to be suppressed, they are only supposed to increment one of the rest counters (see cat /sys/kernel/debug/ieee80211/phy1/ath9k/queues as an example:

root@router:~# cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 0 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 597 Stuck Beacon: 0 MCI Reset: 0 Calibration error: 2 Tx DMA stop error: 599 Rx DMA stop error: 0

The DMA failed errors should only increase the "Tx DMA stop error" counter and not appear in the log. Interestingly after the loss of the radios I typically get "Fatal HW Error" > 0, do you see the same?

aparcar commented 8 years ago

joaochainho:

Hi @moeller0, I don't remember having Fatal HW Error > 0, but I get lots of stuck beacons. MAC Hang: 46 Stuck Beacon: 749162 Tx DMA stop error: 170 Rx DMA stop error: 3 Sometimes the interface hangs up silently (no dmesg output). I might be wrong but I think this issues started around the time mac80211 intermediate software queues were introduced.

aparcar commented 8 years ago

nbd:

Please try the latest version from my staging tree.

aparcar commented 8 years ago

hnyman:

@moeller0

I compiled a test build: lede-r1708-20160928-with-ath9k-remove-patch-causing-stability-issues-with-power

It is normal up-to-date LEDE r1708 with the patch from Felix (that removes 341 and modifies the others as needed).

aparcar commented 8 years ago

moeller0:

@hnyman

I will go and test that build starting tonight and will report as usual. lede-r1708-20160928-with-ath9k-remove-patch-causing-stability-issues-with-power

Question, I will use your r1698 build as positive control (so I assume the failure to be present in r1698, but not the Felix's r1708).

Many thanks.

aparcar commented 8 years ago

moeller0:

Okay, lede-r1708-20160928-with-ath9k-remove-patch-causing-stability-issues-with-power seems to be working, at least I did not manage to trigger a failure in 34 hours....

root@router:~# cat /etc/banner


/        /\      _    ___ ___  ___

/ LE / \ | | | | | | / DE / \ | || || |) | | /__/ LE \ |__|__|/|_| lede-project.org \ \ DE / \ LE \ / ----------------------------------------------------------- \ DE \ / Reboot (HEAD, r1708) ______\/ -----------------------------------------------------------

root@router:~# uptime 10:02:05 up 1 day, 10:51, load average: 0.00, 0.01, 0.00 root@router:~# cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 0 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 445 Stuck Beacon: 1551 MCI Reset: 0 Calibration error: 1 Tx DMA stop error: 167 Rx DMA stop error: 0 root@router:~# cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 0 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 808 Stuck Beacon: 0 MCI Reset: 0 Calibration error: 0 Tx DMA stop error: 808 Rx DMA stop error: 0

aparcar commented 8 years ago

nbd:

Pushed to master in r1725

Thanks for testing!

aparcar commented 8 years ago

moeller0:

Gentlemen,

thanks to both of you, you did the heavy lifting, all I did was doing some "leg-work".

aparcar commented 8 years ago

fbettag:

Hey guys,

i've been running r1712 from nbd's staging tree and everything was fine until like 10 minutes ago:

root@wifi-food:~# uptime 18:26:06 up 2 days, 5:02, load average: 0.00, 0.00, 0.00

root@wifi-food:~# dmesg ... [159489.201588] br-lan: port 3(wlan1) neighbor 7fff.00:0d:b9:1f:87:cc lost [159489.208291] br-lan: topology change detected, propagating [185348.587359] ath: phy1: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42100020 DMADBG_7=0x000062c0 [185365.647171] br-lan: port 3(wlan1) neighbor 7fff.00:0d:b9:1f:87:cc lost ...

root@wifi-food:~# cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 0 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 0 PLL RX Hang: 0 MAC Hang: 91 Stuck Beacon: 1 MCI Reset: 0 Calibration error: 0 Tx DMA stop error: 0 Rx DMA stop error: 0

root@wifi-food:~# cat /sys/kernel/debug/ieee80211/phy1/ath9k/reset Baseband Hang: 0 Baseband Watchdog: 0 Fatal HW Error: 1 TX HW error: 0 Transmit timeout: 0 TX Path Hang: 2037 PLL RX Hang: 0 MAC Hang: 3 Stuck Beacon: 983 MCI Reset: 0 Calibration error: 0 Tx DMA stop error: 3023 Rx DMA stop error: 1

aparcar commented 8 years ago

fbettag:

I also want to mention that it has improved a lot! Before this was an hourly or 10-minute occurance!

aparcar commented 7 years ago

hnyman:

Why was this reopened by stintel? There is no explanation about that and there has been no discussion since September...

I thought that the specific issue was solved already.

aparcar commented 7 years ago

nbd:

Please test the latest version