2.4GHz WLAN dies after some time

jplitza commented 10 years ago

In Bremen we have multiple reports of the (2.4 GHz) WLAN dying after some time (6 days, give or take a few).

This concerns both meshing and clients. I tried to debug the problem with clients: tcpdump -i br-client on the node shows the DHCP requests and replies, while tcpdump -i wlan0 only shows the requests, not the replies (and they never reach the client). ebtables are empty, when inserting a logging rule into each possible chain the output was

[1235303.240000] eb_brt_brt IN=bat0 OUT= MAC source = c6:b6:a5:52:3a:9b MAC dest = 98:0c:82:c2:ad:0a proto = 0x0800
[1235303.250000] eb_nat_pre IN=bat0 OUT= MAC source = c6:b6:a5:52:3a:9b MAC dest = 98:0c:82:c2:ad:0a proto = 0x0800
[1235303.260000] eb_flt_fwd IN=bat0 OUT=wlan0 MAC source = c6:b6:a5:52:3a:9b MAC dest = 98:0c:82:c2:ad:0a proto = 0x0800
[1235303.270000] eb_nat_pst IN= OUT=wlan0 MAC source = c6:b6:a5:52:3a:9b MAC dest = 98:0c:82:c2:ad:0a proto = 0x0800
[1235303.280000] eb_flt_inp IN=bat0 OUT= MAC source = c6:b6:a5:52:3a:9b MAC dest = 98:0c:82:c2:ad:0a proto = 0x0800

As you can see, the bridge found the correct OUT interface (wlan0) but the packet never goes out that interface.

To make things even more confusing, when using 5 GHz (wlan1), everything worked fine. The ebtables log was the same.

I didn't debug the meshing interface wlan0-1 any further, but the meshing with my other (2.4 GHz-only) node was broken at that time as well.

Executing wifi on the failing node fixed both the client connection and the meshing connection.

Other people reported similar issues. They didn't come up during the nightly phase, presumably because the nodes were rebooting regularly (nearly each night after an update).

Any ideas on how to debug this any further when it strikes again? Relevant logs?

neocturne commented 10 years ago

What hardware are you on? We know this behavour primarily from TL-WR1043ND v1 (known hardware issue), most other devices seem to be much more stable.

jplitza commented 10 years ago

Indeed at least one of the nodes with this problem was a TL-WR1043ND. Not mine though, that's a TL-WDR4300, and two TL-WR841ND were reported to have similar problems. Can't guarantee it's exactly the same, but sounded like it.

jplitza commented 10 years ago

Currently having the issue again, so it lasts for only 5 days or so...

Any debugging-suggestions?

T-X commented 10 years ago

dmesg output could be helpful as well as a tcpdump of an interface in monitor mode including a timestamp when the issue occurs in the dump. Maybe we can simplify the setup somehow, does disabling the adhoc interface section in /etc/config/wireless make any difference, does decreasing the txpower (to reduce the number of associated clients) make any difference?

Have you checked the OpenWRT bug tracker? if none exists there, then it'd probably make sense to create a ticket there. nbd is usually the guy there for all ath9k related issues.

Does it last for about 5 days or does it happen at about the same time of the day? To know whether the amount of something causes the issue and/or the appearance of a client in a certain condition at a certain time.

T-X commented 10 years ago

Could you upload the tcpdumps for br-client and wlan0 with the DHCP packets, somewhere? That somehow sounds like a bridge and not a wifi bug. Or a general network bug in the kernel, something not setting some skb-fields properly, maybe related to netfilter, too

jplitza commented 10 years ago

The problem partially vanished over night: client connections are working alright again. The mesh connection is still broken, but maybe it's the other node's fault (after all, the problem could occur on both nodes).

neocturne commented 10 years ago

Upstream bug: https://dev.openwrt.org/ticket/15464

neocturne commented 10 years ago

Everyone who has seen this bug: Please add a comment with the models and revisions of the hardware it occured on.

ohrensessel commented 10 years ago

WDR3600.

My WR842 is meshing with the WDR3600, which has a VPN uplink. At 2 o'clock the WR842 was online again after several hours of downtime, exactly after the WDR3600 got the new experimental version and therefore rebooted. The WR842 did not reboot (visible from the uptime distributed via alfred) yet.

jplitza commented 10 years ago

WDR4300 v1.6. Having it right now again (I had hoped the mac80211 update in bcecec23 fixed it, but apparently it didn't).

dmesg says many times

net_ratelimit: N callbacks suppressed

and somewhere in between (with many ratelimits in between)

[213257.970000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[215985.390000] ath: phy0: Failed to stop TX DMA, queues=0x004!
[387234.360000] ath: phy0: Failed to stop TX DMA, queues=0x004!
[502072.630000] ath: phy0: Failed to stop TX DMA, queues=0x001!

Don't know if thats relevant, the timestamps don't seem to coincide with the disappearing of my other, normally mesh-associated node.

ohrensessel commented 10 years ago

I think the "Failed to stop TX DMA" messages are relatively normal. I have seen them on quite a number of nodes during normal operation, even a long time pre-gluon.

neocturne commented 10 years ago

Please test again with 5df63b7f65ba5ce746d65b67660dbc114837e470 (Lübeck: 0.4+0-exp20140501) or newer. Don't forget to run make cleanall so all kernel patches get updated.

neocturne commented 10 years ago

Hmm, new comments in the upstream bug report suggest the new patches don't help...

ohrensessel commented 10 years ago

Just experienced the bug again on my 3600. Reboot got all mesh connections up and running again and clients on a meshing 842 could use freifunk again.

deerwood commented 10 years ago

TL-WR842ND v1: have 2 of them, A with Gluon (0.4a ffhh) and VPN and B with with pre Gluon and no VPN, 2 rooms apart and meshing rather good normally. B has 1-7 clients and an uptime of several month. Before Gluon A also had an uptime of several month. A has 2-19 clients depending on day time. Since Gluon on A I experience problems:

a) seldom A reboots without my intervention b) more important: A dies as outlined above, no meshing to B (and another node) any more, at the same time the clients have no Internet access any more, though the SSID is still announced. I reboot then. This happens rather often, in the last days I had uptimes of only 1/2 hour to ~22 hours. I can't see any relation to external circumstances (temperature, power loss).

The only thing: the failure seems to be more often with many clients, but there is no direct relation. E.g. even with 2-15 clients it MAY work for 22 hours, but it may also fail after 2 hours etc.

Joachim-Wolff commented 10 years ago

I have observed this (or similar) issue on TL-WR841ND v8 running Hamburg FW before Gluon (Attitude Adjustment, r37768), but sitting via WLAN between two Gluon v2014.1a nodes, node has no VPN link.

"iw dev wlan0-1 station dump" and "batctl o" display no neighbors, also nothing in the log, "wifi" fix the problem. Occurred twice at an uptime of 11 and 47 hours.

jplitza commented 10 years ago

In Bremen we have fewer problems since the nightly from 2014-05-01, probably commit 5df63b7f65ba5ce746d65b67660dbc114837e470. So even though @NeoRaider said that "upstream comments suggest it doesn't help", it does for us. Hopefully. (Hardware still is WDR4300 for me, but others reported fewer problems as well)

sargon commented 10 years ago

Maybe related: I discovered that on my UBNT UniFi AP Outdoor, every time the wireless connection dies, the hostapd process has been vanished. So maybe this bug is related to some problem in hostapd.

jplitza commented 10 years ago

Just encountered the problem again. The hostapd processes were still running, but instead of executing wifi I tried restarting the 2.4GHz-one with the same command line as before and it worked. Maybe this is no news, but I didn't know what magic wifi does apart from restarting hostapd.

rubo77 commented 10 years ago

Please add a cronjob that checks every 5 minutes if there are no clients connected and eventually calls wifi then to restart eventually crashed services

neocturne commented 10 years ago

Using the latest version (4e6628cebf059310f6f5217243994bc41f38b213), I can't reproduce the problems anymore. Is everyone using the latest version and does make update and make clean after each update?

neocturne commented 10 years ago

Hmm, now I hit this bug again, but this time it took much longer that before (about 72 hours).

T-X commented 10 years ago

So, reverting the mac80211/ath9k backport, switching back to the one in AA? Or what was the issue with mac80211/ath9k from AA?

jplitza commented 10 years ago

Or what was the issue with mac80211/ath9k from AA?

At least the newer models like 841v9 and 1043v2 have no chance of being supported there, if I understood correctly.

rubo77 commented 10 years ago

I thought at least here in Kiel we have the 841 v9 running fine, don't they?

neocturne commented 10 years ago

@rubo77, the point is: We can't downgrade to an older (possibly more stable) mac80211 version, as we'd lose support for newer models like the 841 v9 then.

deerwood commented 10 years ago

Because I have a stable meshing neighbor node I've a cron job checking, if meshing is still there, essentially by batctl originators | grep 'wlan0-1]'. If not, I check 3 more times/minutes to avoid false alarms. Then I record dmesg and logread output and finally wifi to bring the node to life again. In my experience batctl originators still lists meshing neighbors ~ 5 minutes after the nodes wifi already went down and clients have no internet access any more.

Inspecting the dmesg and logread output around the time, the wifi probably broke (-8 minutes after recording the logs), there is nothing unusual ... EXCEPT for this message from logread, which is there in almost any problem case (~12 times since recording logs):

daemon.info hostapd: wlan0: STA 70:56:81:XX:XX:XX IEEE 802.11: disconnected due to excessive missing ACKs

In my case it's always the same MAC (some Apple device from the neighborhood) and I can't tell, if this is just a SYMPTOM of the wifi already broken or maybe the CAUSE of the problem? Other clients connected at the same time DON'T show similar behavior.

What else can I record to help in debugging?

Best regards, Georg

neocturne commented 10 years ago

AFAICT, the "disconnected due to excessive missing ACKs" messages are a consequence of the WLAN being broken and not the cause.

iw dev wlan0 station dump and iw dev wlan0-1 station dump is a better check for WLAN saneness than batctl, as batman-adv will take a while till the originators time out. Maybe additionally doing a scan for other networks would be even better, but I haven't testet that. If it works (i.e., the scan doesn't show any networks when the WLAN is broken), it would lead to less false positives when no clients are connected and no other nodes are in range.

Unfortunately, I don't think we can do anything to further debug the issue. Fortunately, comments in the upstream bug suggest it's easy to reproduce the issue with vanilla OpenWrt by causing enough traffic, so I hope we can expect a fix soon. I don't known anything near enough about the internals of ath9k to debug this myself, and familarizing with it would cost a lot of time that I can probably better invest elsewhere... so waiting for a fix from upstream it is.

deerwood commented 10 years ago

I'm still checking with batctl originators | grep 'wlan0-1] and record iw dev wlan0-1 station dump (for meshing neighbors) as well as iw dev wlan0 station dump (for clients). Additionally I record iwinfo wlan0-1 scan (for other networks) before restarting the wifi.

Thus I can tell:

the ONLY reliable way to check for the problem is using batctl o with stable neighbors. When batctl tells "no neighbors any more" then all other commands signal "all is well". E.g. iw dev wlan0-1 station dump shows all 3 potential "freifunk" nodes, and iwinfo ... is full blown, as usual.

Best regards, Georg

tcatm commented 10 years ago

As of Gluon 2014.2 the bug is quite easily detectable using the debugfs. Normally, cat /sys/kernel/debug/ieee80211/phy0/ath9k/queues will show something like:

(VO):  qnum: 0 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(VI):  qnum: 1 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(BE):  qnum: 2 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(BK):  qnum: 3 qdepth:  0 ampdu-depth:  0 pending:   0 stopped: 0
(CAB): qnum: 8 qdepth:  9 ampdu-depth:  0 pending:   0 stopped: 0

Once the bug is triggered, the (BE) queue will change to stopped: 1:

(BE):  qnum: 2 qdepth:  0 ampdu-depth:  0 pending: 125 stopped: 1

The BE (Best Effort) queue is the default queue for virtually all traffic. It is stopped due to being full.

According to nbd this is not caused by ANI.

Running wifi is sufficient to get a device running again.

T-X commented 10 years ago

We had a node with stalled wifi here with an experimental branch firmware of about the last stable release. Unfortunately I wasn't able to check whether the "stopped" was present, because attaching a network cable to the wan-port resulted into a short hiccup throwing this trace, giving an interesting pointer to a certain line in ath9k/recv.c:

[788499.920000] ath: phy0: Failed to stop TX DMA, queues=0x004!
[788499.940000] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020 DMADBG_7=0x000084c0
[788499.950000] ath: phy0: Could not stop RX, we could be confusing the DMA engine when we start RX up
[788499.960000] ------------[ cut here ]------------
[788499.960000] WARNING: at /home/volatile/nils/gluon-nightly/gluon/openwrt/build_dir/linux-ar71xx_generic/compat-wireless-2014-05-22/drive
rs/net/wireless/ath/ath9k/recv.c:493 0x81be745c()
[788499.980000] Modules linked in: ath79_wdt macvlan ledtrig_usbdev ledtrig_netdev batman_adv(O) ip6t_REJECT ip6t_LOG ip6t_rt ip6t_hbh ip6t
_mh ip6t_ipv6header ip6t_frag ip6t_eui64 ip6t_ah ip6table_raw ip6_queue ip6table_mangle ip6table_filter ip6_tables nf_conntrack_ipv6 nf_def
rag_ipv6 ebt_ip6 ebt_snat ebt_dnat ebt_arpreply ebt_ip ebt_arp ebt_redirect ebt_mark ebt_vlan ebt_stp ebt_pkttype ebt_mark_m ebt_limit ebt_
among ebt_802_3 ebtable_nat ebtable_filter ebtable_broute ebtables nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp ipt_MASQUERADE i
ptable_nat nf_nat xt_conntrack xt_CT xt_NOTRACK iptable_raw xt_state nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ehci_hcd ipt_REJECT xt_T
CPMSS ipt_LOG xt_comment xt_multiport xt_mac xt_limit iptable_mangle iptable_filter ip_tables xt_tcpudp x_tables tun act_police cls_u32 sch
_tbf sch_ingress ath9k(O) ath9k_common(O) ath9k_hw(O) ath(O) mac80211(O) usbcore usb_common nls_base libcrc32c crc16 ipv6 cfg80211(O) compa
t(O) arc4 aes_generic crc32c crypto_hash crypto_algapi ledtrig_timer ledtrig_default_on leds_gpio gpio_button_hotplug(O)
[788500.070000] Call Trace:[<80285a28>] 0x80285a28
[788500.080000] [<80285a28>] 0x80285a28
[788500.080000] [<80071aac>] 0x80071aac
[788500.090000] [<81be745c>] 0x81be745c
[788500.090000] [<81bf0000>] 0x81bf0000
[788500.090000] [<80071af0>] 0x80071af0
[788500.100000] [<81be745c>] 0x81be745c
[788500.100000] [<81be2fa0>] 0x81be2fa0
[788500.100000] [<81be4eb0>] 0x81be4eb0
[788500.110000] [<81be56f0>] 0x81be56f0
[788500.110000] [<81be3224>] 0x81be3224
[788500.120000] [<81be56f0>] 0x81be56f0
[788500.120000] [<81be54fc>] 0x81be54fc
[788500.120000] [<80085fa0>] 0x80085fa0
[788500.130000] [<81be16e0>] 0x81be16e0
[788500.130000] [<80086a50>] 0x80086a50
[788500.130000] [<80289124>] 0x80289124
[788500.140000] [<80086824>] 0x80086824
[788500.140000] [<80086824>] 0x80086824
[788500.140000] [<8008a394>] 0x8008a394
[788500.150000] [<80064b90>] 0x80064b90
[788500.150000] [<8008a30c>] 0x8008a30c
[788500.160000] [<80064b80>] 0x80064b80
[788500.160000] 
[788500.160000] ---[ end trace 2c07e113c3e2609f ]---

And after that, the wife of the same node worked again - without any reboot or "wifi" :D. Do others see the same behaviour, that giving the network stack a bump at some other place makes the wifi work again (like here simply, physically attaching a network cable)?

ohrensessel commented 10 years ago

I think (!) I noticed the same yesterday. My node had stopped: 1 yesterday so I tried to get a dump of the ath9k debugfs. While trying that multiple times I always got segmentation faults. At same point stopped read zero again. No reboot or call of wifi during that time.

tcatm commented 10 years ago

@ohrensessel Did the wifi work again after stopped got back to 0?

ohrensessel commented 10 years ago

yes

tcatm commented 10 years ago

That's interesting. If possible, can you get a timeline of the events? I.e. how long did it take to go from stopped: 1 to stopped: 0 and what commands did you issue in the meanwhile?

ohrensessel commented 10 years ago

sadly this is kind of hard to tell. while during some other things I tried to tar the ath9k debugfs directly or copy it to /tmp. While doing so I always got segmentation faults. As I then had other stuff to do, I wanted to try it again later. When I checked for the stopped condition at that later time, it was gone. Maybe stopped lasted for about an hour.

tcatm commented 10 years ago

Another stuck node. All queues running. Interrupts are triggering (both RX and TX).

[502096.420000] ath: phy0: Failed to stop TX DMA, queues=0x004!
[507097.430000] ath: phy0: Timeout while waiting for nf to load: AR_PHY_AGC_CONTROL=0x1d0d5a

ohrensessel commented 10 years ago

I also had the second message on my node several times. Even when everything seemed to work. I will observe that now

tcatm commented 10 years ago

This issue is now tracked in #130.

freifunk-gluon / gluon

2.4GHz WLAN dies after some time #93