Closed jplitza closed 10 years ago
What hardware are you on? We know this behavour primarily from TL-WR1043ND v1 (known hardware issue), most other devices seem to be much more stable.
Indeed at least one of the nodes with this problem was a TL-WR1043ND. Not mine though, that's a TL-WDR4300, and two TL-WR841ND were reported to have similar problems. Can't guarantee it's exactly the same, but sounded like it.
Currently having the issue again, so it lasts for only 5 days or so...
Any debugging-suggestions?
dmesg output could be helpful as well as a tcpdump of an interface in monitor mode including a timestamp when the issue occurs in the dump. Maybe we can simplify the setup somehow, does disabling the adhoc interface section in /etc/config/wireless make any difference, does decreasing the txpower (to reduce the number of associated clients) make any difference?
Have you checked the OpenWRT bug tracker? if none exists there, then it'd probably make sense to create a ticket there. nbd is usually the guy there for all ath9k related issues.
Does it last for about 5 days or does it happen at about the same time of the day? To know whether the amount of something causes the issue and/or the appearance of a client in a certain condition at a certain time.
Could you upload the tcpdumps for br-client and wlan0 with the DHCP packets, somewhere? That somehow sounds like a bridge and not a wifi bug. Or a general network bug in the kernel, something not setting some skb-fields properly, maybe related to netfilter, too
The problem partially vanished over night: client connections are working alright again. The mesh connection is still broken, but maybe it's the other node's fault (after all, the problem could occur on both nodes).
Upstream bug: https://dev.openwrt.org/ticket/15464
Everyone who has seen this bug: Please add a comment with the models and revisions of the hardware it occured on.
WDR3600.
My WR842 is meshing with the WDR3600, which has a VPN uplink. At 2 o'clock the WR842 was online again after several hours of downtime, exactly after the WDR3600 got the new experimental version and therefore rebooted. The WR842 did not reboot (visible from the uptime distributed via alfred) yet.
WDR4300 v1.6. Having it right now again (I had hoped the mac80211 update in bcecec23 fixed it, but apparently it didn't).
dmesg
says many times
net_ratelimit: N callbacks suppressed
and somewhere in between (with many ratelimits in between)
[213257.970000] ath: phy0: Failed to stop TX DMA, queues=0x005!
[215985.390000] ath: phy0: Failed to stop TX DMA, queues=0x004!
[387234.360000] ath: phy0: Failed to stop TX DMA, queues=0x004!
[502072.630000] ath: phy0: Failed to stop TX DMA, queues=0x001!
Don't know if thats relevant, the timestamps don't seem to coincide with the disappearing of my other, normally mesh-associated node.
I think the "Failed to stop TX DMA" messages are relatively normal. I have seen them on quite a number of nodes during normal operation, even a long time pre-gluon.
Please test again with 5df63b7f65ba5ce746d65b67660dbc114837e470 (Lübeck: 0.4+0-exp20140501) or newer. Don't forget to run make cleanall
so all kernel patches get updated.
Hmm, new comments in the upstream bug report suggest the new patches don't help...
Just experienced the bug again on my 3600. Reboot got all mesh connections up and running again and clients on a meshing 842 could use freifunk again.
TL-WR842ND v1: have 2 of them, A with Gluon (0.4a ffhh) and VPN and B with with pre Gluon and no VPN, 2 rooms apart and meshing rather good normally. B has 1-7 clients and an uptime of several month. Before Gluon A also had an uptime of several month. A has 2-19 clients depending on day time. Since Gluon on A I experience problems:
a) seldom A reboots without my intervention b) more important: A dies as outlined above, no meshing to B (and another node) any more, at the same time the clients have no Internet access any more, though the SSID is still announced. I reboot then. This happens rather often, in the last days I had uptimes of only 1/2 hour to ~22 hours. I can't see any relation to external circumstances (temperature, power loss).
The only thing: the failure seems to be more often with many clients, but there is no direct relation. E.g. even with 2-15 clients it MAY work for 22 hours, but it may also fail after 2 hours etc.
I have observed this (or similar) issue on TL-WR841ND v8 running Hamburg FW before Gluon (Attitude Adjustment, r37768), but sitting via WLAN between two Gluon v2014.1a nodes, node has no VPN link.
"iw dev wlan0-1 station dump" and "batctl o" display no neighbors, also nothing in the log, "wifi" fix the problem. Occurred twice at an uptime of 11 and 47 hours.
In Bremen we have fewer problems since the nightly from 2014-05-01, probably commit 5df63b7f65ba5ce746d65b67660dbc114837e470. So even though @NeoRaider said that "upstream comments suggest it doesn't help", it does for us. Hopefully. (Hardware still is WDR4300 for me, but others reported fewer problems as well)
Maybe related: I discovered that on my UBNT UniFi AP Outdoor, every time the wireless connection dies, the hostapd process has been vanished. So maybe this bug is related to some problem in hostapd.
Just encountered the problem again. The hostapd processes were still running, but instead of executing wifi
I tried restarting the 2.4GHz-one with the same command line as before and it worked. Maybe this is no news, but I didn't know what magic wifi
does apart from restarting hostapd.
Please add a cronjob that checks every 5 minutes if there are no clients connected and eventually calls wifi
then to restart eventually crashed services
Using the latest version (4e6628cebf059310f6f5217243994bc41f38b213), I can't reproduce the problems anymore. Is everyone using the latest version and does make update
and make clean
after each update?
Hmm, now I hit this bug again, but this time it took much longer that before (about 72 hours).
So, reverting the mac80211/ath9k backport, switching back to the one in AA? Or what was the issue with mac80211/ath9k from AA?
Or what was the issue with mac80211/ath9k from AA?
At least the newer models like 841v9 and 1043v2 have no chance of being supported there, if I understood correctly.
I thought at least here in Kiel we have the 841 v9 running fine, don't they?
@rubo77, the point is: We can't downgrade to an older (possibly more stable) mac80211 version, as we'd lose support for newer models like the 841 v9 then.
Because I have a stable meshing neighbor node I've a cron job checking, if meshing is still there, essentially by batctl originators | grep 'wlan0-1]'
. If not, I check 3 more times/minutes to avoid false alarms. Then I record dmesg
and logread
output and finally wifi
to bring the node to life again. In my experience batctl originators
still lists meshing neighbors ~ 5 minutes after the nodes wifi already went down and clients have no internet access any more.
Inspecting the dmesg
and logread
output around the time, the wifi probably broke (-8 minutes after recording the logs), there is nothing unusual ... EXCEPT for this message from logread
, which is there in almost any problem case (~12 times since recording logs):
daemon.info hostapd: wlan0: STA 70:56:81:XX:XX:XX IEEE 802.11: disconnected due to excessive missing ACKs
In my case it's always the same MAC (some Apple device from the neighborhood) and I can't tell, if this is just a SYMPTOM of the wifi already broken or maybe the CAUSE of the problem? Other clients connected at the same time DON'T show similar behavior.
What else can I record to help in debugging?
Best regards, Georg
AFAICT, the "disconnected due to excessive missing ACKs" messages are a consequence of the WLAN being broken and not the cause.
iw dev wlan0 station dump
and iw dev wlan0-1 station dump
is a better check for WLAN saneness than batctl
, as batman-adv will take a while till the originators time out. Maybe additionally doing a scan for other networks would be even better, but I haven't testet that. If it works (i.e., the scan doesn't show any networks when the WLAN is broken), it would lead to less false positives when no clients are connected and no other nodes are in range.
Unfortunately, I don't think we can do anything to further debug the issue. Fortunately, comments in the upstream bug suggest it's easy to reproduce the issue with vanilla OpenWrt by causing enough traffic, so I hope we can expect a fix soon. I don't known anything near enough about the internals of ath9k to debug this myself, and familarizing with it would cost a lot of time that I can probably better invest elsewhere... so waiting for a fix from upstream it is.
I'm still checking with batctl originators | grep 'wlan0-1]
and record iw dev wlan0-1 station dump
(for meshing neighbors) as well as iw dev wlan0 station dump
(for clients). Additionally I record iwinfo wlan0-1 scan
(for other networks) before restarting the wifi.
Thus I can tell:
the ONLY reliable way to check for the problem is using batctl o
with stable neighbors. When batctl tells "no neighbors any more" then all other commands signal "all is well". E.g. iw dev wlan0-1 station dump
shows all 3 potential "freifunk" nodes, and iwinfo ...
is full blown, as usual.
Best regards, Georg
As of Gluon 2014.2 the bug is quite easily detectable using the debugfs. Normally, cat /sys/kernel/debug/ieee80211/phy0/ath9k/queues
will show something like:
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 9 ampdu-depth: 0 pending: 0 stopped: 0
Once the bug is triggered, the (BE)
queue will change to stopped: 1
:
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 125 stopped: 1
The BE (Best Effort) queue is the default queue for virtually all traffic. It is stopped due to being full.
According to nbd this is not caused by ANI.
Running wifi
is sufficient to get a device running again.
We had a node with stalled wifi here with an experimental branch firmware of about the last stable release. Unfortunately I wasn't able to check whether the "stopped" was present, because attaching a network cable to the wan-port resulted into a short hiccup throwing this trace, giving an interesting pointer to a certain line in ath9k/recv.c:
[788499.920000] ath: phy0: Failed to stop TX DMA, queues=0x004! [788499.940000] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x42000020 DMADBG_7=0x000084c0 [788499.950000] ath: phy0: Could not stop RX, we could be confusing the DMA engine when we start RX up [788499.960000] ------------[ cut here ]------------ [788499.960000] WARNING: at /home/volatile/nils/gluon-nightly/gluon/openwrt/build_dir/linux-ar71xx_generic/compat-wireless-2014-05-22/drive rs/net/wireless/ath/ath9k/recv.c:493 0x81be745c() [788499.980000] Modules linked in: ath79_wdt macvlan ledtrig_usbdev ledtrig_netdev batman_adv(O) ip6t_REJECT ip6t_LOG ip6t_rt ip6t_hbh ip6t _mh ip6t_ipv6header ip6t_frag ip6t_eui64 ip6t_ah ip6table_raw ip6_queue ip6table_mangle ip6table_filter ip6_tables nf_conntrack_ipv6 nf_def rag_ipv6 ebt_ip6 ebt_snat ebt_dnat ebt_arpreply ebt_ip ebt_arp ebt_redirect ebt_mark ebt_vlan ebt_stp ebt_pkttype ebt_mark_m ebt_limit ebt_ among ebt_802_3 ebtable_nat ebtable_filter ebtable_broute ebtables nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp ipt_MASQUERADE i ptable_nat nf_nat xt_conntrack xt_CT xt_NOTRACK iptable_raw xt_state nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ehci_hcd ipt_REJECT xt_T CPMSS ipt_LOG xt_comment xt_multiport xt_mac xt_limit iptable_mangle iptable_filter ip_tables xt_tcpudp x_tables tun act_police cls_u32 sch _tbf sch_ingress ath9k(O) ath9k_common(O) ath9k_hw(O) ath(O) mac80211(O) usbcore usb_common nls_base libcrc32c crc16 ipv6 cfg80211(O) compa t(O) arc4 aes_generic crc32c crypto_hash crypto_algapi ledtrig_timer ledtrig_default_on leds_gpio gpio_button_hotplug(O) [788500.070000] Call Trace:[<80285a28>] 0x80285a28 [788500.080000] [<80285a28>] 0x80285a28 [788500.080000] [<80071aac>] 0x80071aac [788500.090000] [<81be745c>] 0x81be745c [788500.090000] [<81bf0000>] 0x81bf0000 [788500.090000] [<80071af0>] 0x80071af0 [788500.100000] [<81be745c>] 0x81be745c [788500.100000] [<81be2fa0>] 0x81be2fa0 [788500.100000] [<81be4eb0>] 0x81be4eb0 [788500.110000] [<81be56f0>] 0x81be56f0 [788500.110000] [<81be3224>] 0x81be3224 [788500.120000] [<81be56f0>] 0x81be56f0 [788500.120000] [<81be54fc>] 0x81be54fc [788500.120000] [<80085fa0>] 0x80085fa0 [788500.130000] [<81be16e0>] 0x81be16e0 [788500.130000] [<80086a50>] 0x80086a50 [788500.130000] [<80289124>] 0x80289124 [788500.140000] [<80086824>] 0x80086824 [788500.140000] [<80086824>] 0x80086824 [788500.140000] [<8008a394>] 0x8008a394 [788500.150000] [<80064b90>] 0x80064b90 [788500.150000] [<8008a30c>] 0x8008a30c [788500.160000] [<80064b80>] 0x80064b80 [788500.160000] [788500.160000] ---[ end trace 2c07e113c3e2609f ]---
And after that, the wife of the same node worked again - without any reboot or "wifi" :D. Do others see the same behaviour, that giving the network stack a bump at some other place makes the wifi work again (like here simply, physically attaching a network cable)?
I think (!) I noticed the same yesterday. My node had stopped: 1
yesterday so I tried to get a dump of the ath9k debugfs. While trying that multiple times I always got segmentation faults. At same point stopped read zero again. No reboot or call of wifi during that time.
@ohrensessel Did the wifi work again after stopped got back to 0?
yes
That's interesting. If possible, can you get a timeline of the events? I.e. how long did it take to go from stopped: 1
to stopped: 0
and what commands did you issue in the meanwhile?
sadly this is kind of hard to tell. while during some other things I tried to tar the ath9k debugfs directly or copy it to /tmp. While doing so I always got segmentation faults. As I then had other stuff to do, I wanted to try it again later. When I checked for the stopped condition at that later time, it was gone. Maybe stopped lasted for about an hour.
Another stuck node. All queues running. Interrupts are triggering (both RX and TX).
[502096.420000] ath: phy0: Failed to stop TX DMA, queues=0x004!
[507097.430000] ath: phy0: Timeout while waiting for nf to load: AR_PHY_AGC_CONTROL=0x1d0d5a
I also had the second message on my node several times. Even when everything seemed to work. I will observe that now
This issue is now tracked in #130.
In Bremen we have multiple reports of the (2.4 GHz) WLAN dying after some time (6 days, give or take a few).
This concerns both meshing and clients. I tried to debug the problem with clients:
tcpdump -i br-client
on the node shows the DHCP requests and replies, whiletcpdump -i wlan0
only shows the requests, not the replies (and they never reach the client). ebtables are empty, when inserting a logging rule into each possible chain the output wasAs you can see, the bridge found the correct OUT interface (wlan0) but the packet never goes out that interface.
To make things even more confusing, when using 5 GHz (wlan1), everything worked fine. The ebtables log was the same.
I didn't debug the meshing interface wlan0-1 any further, but the meshing with my other (2.4 GHz-only) node was broken at that time as well.
Executing
wifi
on the failing node fixed both the client connection and the meshing connection.Other people reported similar issues. They didn't come up during the nightly phase, presumably because the nodes were rebooting regularly (nearly each night after an update).
Any ideas on how to debug this any further when it strikes again? Relevant logs?