cyoung / stratux

Aviation weather and traffic receiver based on RTL-SDR.
BSD 3-Clause "New" or "Revised" License
1.06k stars 362 forks source link

Wi-Fi does not reconnect when Edimax unplugged/plugged #246

Closed Ergonomicmike closed 8 years ago

Ergonomicmike commented 8 years ago

Some users (including me) report random (seemingly) Wi-Fi disconnects. We've been brainstorming on the reddit about what could be causing it.

I don't know if what I found here is ultimately responsible for the disconnect. But if you unplug and then replug the Edimax, the Wi-Fi connection never comes back.

Now, it seems to me that, just like when the GPS is unplugged and replugged, the Edimax should gracefully terminate when unplugged and then restart when re-plugged. At least then we could eliminate a small scale physical disconnect (perhaps by a low power transient or an intermittent connection at a micro-level?) causing a large scale permanent disconnect.

Here's an excerpt from syslog when I unplugged the Edimax:

Feb 13 00:25:57 raspberrypi dhcpd: receive_packet failed on wlan0: Network is down
Feb 13 00:25:57 raspberrypi kernel: [  178.607381] usb 1-1.2: USB disconnect, device number 4
Feb 13 00:25:57 raspberrypi kernel: [  178.607972] rtw_cmd_thread: DriverStopped(0) SurpriseRemoved(1) break at line 482
Feb 13 00:25:58 raspberrypi ifplugd(wlan0)[1601]: Exiting.

And here's an excerpt when I plugged the Edimax back in. (It never reconnects.)

Feb 13 00:27:01 raspberrypi kernel: [  241.826018] usb 1-1.2: new high-speed USB device number 8 using dwc_otg
Feb 13 00:27:01 raspberrypi kernel: [  241.928046] usb 1-1.2: New USB device found, idVendor=7392, idProduct=7811
Feb 13 00:27:01 raspberrypi kernel: [  241.928079] usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Feb 13 00:27:01 raspberrypi kernel: [  241.928097] usb 1-1.2: Product: 802.11n WLAN Adapter
Feb 13 00:27:01 raspberrypi kernel: [  241.928113] usb 1-1.2: Manufacturer: Realtek
Feb 13 00:27:01 raspberrypi kernel: [  241.928129] usb 1-1.2: SerialNumber: 00e04c000001
Feb 13 00:27:01 raspberrypi ifplugd(wlan0)[2288]: ifplugd 0.28 initializing.
Feb 13 00:27:02 raspberrypi ifplugd(wlan0)[2288]: Using interface wlan0/74:DA:38:5B:E0:CD with driver <rtl8192cu> (version: )
Feb 13 00:27:02 raspberrypi ifplugd(wlan0)[2288]: Using detection mode: wireless extension
Feb 13 00:27:02 raspberrypi ifplugd(wlan0)[2288]: Initialization complete, link beat not detected.

I also posted a silent video of htop during this process. (The white flashes on screen show where I unplug and replug the Edimax.)

Ergonomicmike commented 8 years ago

Update: Disregard. After a little more reading, seems like NetworkManager is for client side connections.


Some report that ifplugd has problems. (At least in Jessie.) Say it is not the default in Debian. They suggest NetworkManager instead. I've got it installed but afraid to hack at interfaces and don't know how to create a script for nmcli.

ghost commented 8 years ago

@Ergonomicmike : "But if you unplug and then replug the Edimax, the Wi-Fi connection never comes back."

Confirmed. However, unloading and restarting both hostapd and isc-dhcp-server will restore the connection. You can't SSH into the Pi from a tablet if the Wi-Fi connection fails, but it raises the possibility of putting in a watchdog to restart both processes if / when the device disconnects. This might be an interim way of avoiding power cycles until the root cause of the connection drops is found.

peepsnet commented 8 years ago

I don't know if this is a step in the right direction or what. It doesnt seem very elegant but it is used with our same setup https://flightaware.com/adsb/piaware/build/optional

ghost commented 8 years ago

@peepsnet -- I'd call that a step in the right direction.

Here's the output from consecutive runs of 'ifconfig wlan0' before, during, and after an intentional hardware disconnect:

root@raspberrypi:/home/pi# ifconfig wlan0
wlan0     Link encap:Ethernet  HWaddr 74:da:38:5a:58:df
          inet addr:192.168.10.1  Bcast:192.168.10.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:351285 errors:0 dropped:666 overruns:0 frame:0
          TX packets:192023 errors:0 dropped:217 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:977111 (954.2 KiB)  TX bytes:12260623 (11.6 MiB)

root@raspberrypi:/home/pi# ifconfig wlan0
wlan0: error fetching interface information: Device not found
root@raspberrypi:/home/pi# ifconfig wlan0
wlan0     Link encap:Ethernet  HWaddr 74:da:38:5a:58:df
          inet addr:192.168.10.1  Bcast:192.168.10.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

root@raspberrypi:/home/pi# ifconfig wlan0
wlan0     Link encap:Ethernet  HWaddr 74:da:38:5a:58:df
          inet addr:192.168.10.1  Bcast:192.168.10.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

root@raspberrypi:/home/pi# ifconfig wlan0
wlan0     Link encap:Ethernet  HWaddr 74:da:38:5a:58:df
          inet addr:192.168.10.1  Bcast:192.168.10.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

There are two obvious changes after plugging the adapter back in. First, the RX and TX packet counts both drop to zero. Second, the link status changes from "UP BROADCAST RUNNING MULTICAST" to "UP BROADCAST MULTICAST".

Restarting hostapd and isc-dhcp-server changes status back to "UP BROADCAST RUNNING MULTICAST" and the packets begin flowing again.

root@raspberrypi:/home/pi# ifconfig wlan0
wlan0     Link encap:Ethernet  HWaddr 74:da:38:5a:58:df
          inet addr:192.168.10.1  Bcast:192.168.10.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

root@raspberrypi:/home/pi# service isc-dhcp-server stop
[ ok ] Stopping ISC DHCP server: dhcpd.
root@raspberrypi:/home/pi# service hostapd stop
[ ok ] Stopping advanced IEEE 802.11 management: hostapd.
root@raspberrypi:/home/pi# service hostapd start
[ ok ] Starting advanced IEEE 802.11 management: hostapdioctl[RTL_IOCTL_HOSTAPD]: Invalid argument
root@raspberrypi:/home/pi# service isc-dhcp-server start
[ ok ] Starting ISC DHCP server: dhcpd.
root@raspberrypi:/home/pi# ifconfig wlan0
wlan0     Link encap:Ethernet  HWaddr 74:da:38:5a:58:df
          inet addr:192.168.10.1  Bcast:192.168.10.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:120 errors:0 dropped:0 overruns:0 frame:0
          TX packets:90 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:5416 (5.2 KiB)

So... maybe something like this?

while true ; do
   if ifconfig wlan0 | grep -q "UP BROADCAST RUNNING MULTICAST" ; then
      sleep 30
   else
      echo "Wi-Fi connection down! Attempting reconnection."
      service hostapd restart
      service isc-dhcp-server restart
      sleep 10
   fi
done
Ergonomicmike commented 8 years ago

I'm glad the gurus are on this. From my purely layman's perspective (after looking at the sleep code and trying to get the sense of it), I was wondering if a Wi-Fi disconnect (for whatever reason) puts the Stratux Wi-Fi into a deep sleep from which it cannot wake up? (What they call the "sleep of death" in the Android world.) If I'm understanding correctly, Mr. AvSquirrel is saying, "Yes."

Also, I found some pointers to ifplugd in the code. There is an option -d (delay down), which is set for 5 seconds for the Ethernet connection. But I don't know if it is also set for 5 seconds for the wlan0 connection. There is some pointing going on from lib/udev/ifplugd.again for hotplug devices. But I can't figure out where the arguments there come from or what they are. The default for -d is 5 seconds. If the delay is not already 5 seconds (or even 1 second) for the hotplug wlan0, would a delay help to skip over potential dropouts?

In any event, I plan to go flying in about 6 hours. If someone posts a pull request before then, I can merge it and test it. Otherwise, if I get a disconnect, at least I'll know now to look at syslog.

ghost commented 8 years ago

I pushed a new commit to my fork [ https://github.com/AvSquirrel/stratux/commit/ec8c4b899be2c92cd69a05194f1351647d2c171c ] that logs the rate of Stratux Wi-Fi output, in terms of both bytes and of number of messages.

This will write message statistics to stratux.log once per second. Log format will be similar to this, which is with GPS and 1090ES enabled without any local traffic. WiFi connection was to a single iPhone running ForeFlight. AHRS was enabled at 03:30:05.

2016/02/14 03:30:04 Network messages sent: 2185 total, 5 last second.  Network bytes sent: 39778 total, 91 last second.
2016/02/14 03:30:05 Network messages sent: 2190 total, 5 last second.  Network bytes sent: 39869 total, 91 last second.
2016/02/14 03:30:05 wrote settings.
2016/02/14 03:30:06 Network messages sent: 2195 total, 5 last second.  Network bytes sent: 39960 total, 91 last second.
2016/02/14 03:30:07 Network messages sent: 2202 total, 7 last second.  Network bytes sent: 40109 total, 149 last second.
2016/02/14 03:30:08 Network messages sent: 2227 total, 25 last second.  Network bytes sent: 40789 total, 680 last second.
2016/02/14 03:30:09 Network messages sent: 2252 total, 25 last second.  Network bytes sent: 41470 total, 681 last second.
2016/02/14 03:30:10 Network messages sent: 2277 total, 25 last second.  Network bytes sent: 42151 total, 681 last second.
2016/02/14 03:30:11 Network messages sent: 2302 total, 25 last second.  Network bytes sent: 42832 total, 681 last second.

This won't solve any connectivity issues in and of itself, but could help determine if high message rates are causing disconnects.

@Ergonomicmike - would you like to be a guinea pig and build this to collect some data next time you fly? I'd give it a shot tomorrow, except my TAFs are nothing but -SN OVC015 all day.

Ergonomicmike commented 8 years ago

Yep, I'll do it. We flew today (two short legs of a half hour each) and I left the Stratux on while we were on the ground (two hours). No disconnects this time. I might try to sneak in an afternoon flight on President's Day. But if I can't, it won't be until next Friday or Saturday. (It's not due to wx. Our TAF's are pretty boring.)

Ergonomicmike commented 8 years ago

@Axtel4 should get in on this test. He reports 9K+ 1090 messages/min from home. (Using his superduper antennas, I presume.)

ghost commented 8 years ago

That test might not be necessary -- I found a few things today as I ran a few stress tests on the software, and made some fixes with #249.

I previously hypothesized that a major part, if not root cause, of the networking stability problem is the sheer number of messages that Stratux is throwing at the network adapter. To test this, I started by intentionally saturating sendMsg() with overwhelming numbers of traffic reports:

img_0058

webui - new method 1000 targets on two devices - revised

That's 1001 airplanes: 1000 synthesized targets circling my position, plus one actual DAL flight. With two instances of ForeFlight connected, 1000 targets killed the WiFi connection in 14 seconds. Remember that code I asked you to try? The same status report showed over 2000 messages per second:

2016/02/14 16:29:12 Using /dev/ttyAMA0 for GPS
2016/02/14 16:29:12 Network messages sent: 0 total, 0 last second.  Network bytes sent: 0 total, 0 last second.
2016/02/14 16:29:13 GPS configuration complete
2016/02/14 16:29:14 Network messages sent: 2035 total, 2035 last second.  Network bytes sent: 65085 total, 65085 last second.
2016/02/14 16:29:15 Network messages sent: 4075 total, 2040 last second.  Network bytes sent: 130223 total, 65138 last second.
2016/02/14 16:29:16 Network messages sent: 6105 total, 2030 last second.  Network bytes sent: 195185 total, 64962 last second.
2016/02/14 16:29:17 Network messages sent: 8152 total, 2043 last second.  Network bytes sent: 260435 total, 65216 last second.
2016/02/14 16:29:18 Network messages sent: 10195 total, 2043 last second.  Network bytes sent: 325637 total, 65202 last second.
2016/02/14 16:29:19 Network messages sent: 12301 total, 2106 last second.  Network bytes sent: 392937 total, 67300 last second.
2016/02/14 16:29:20 Network messages sent: 14348 total, 2047 last second.  Network bytes sent: 458352 total, 65415 last second.
2016/02/14 16:29:21 Network messages sent: 16388 total, 2040 last second.  Network bytes sent: 523492 total, 65140 last second.
2016/02/14 16:29:22 Network messages sent: 18428 total, 2040 last second.  Network bytes sent: 588660 total, 65168 last second.
2016/02/14 16:29:23 Network messages sent: 20468 total, 2040 last second.  Network bytes sent: 653812 total, 65152 last second.
2016/02/14 16:29:24 Network messages sent: 22510 total, 2042 last second.  Network bytes sent: 719006 total, 65194 last second.
2016/02/14 16:29:25 Network messages sent: 24555 total, 2045 last second.  Network bytes sent: 784345 total, 65339 last second.
2016/02/14 16:29:26 Network messages sent: 25444 total, 889 last second.  Network bytes sent: 812827 total, 28482 last second.
2016/02/14 16:29:27 Network messages sent: 25444 total, 0 last second.  Network bytes sent: 812827 total, 0 last second.

So, I updated sendTrafficUpdates() so it batched the traffic updates into one long message, rather than hundreds of short messages. This doesn't affect timing or age of the traffic data, since sendTrafficUpdates() used to just iterate over the traffic database once per second, but it makes sending the data much more efficient:

2016/02/14 20:07:01 Network messages sent: 87163 total, 32 last second.  Network bytes sent: 176440816 total, 64754 last second.
2016/02/14 20:07:02 Network messages sent: 87195 total, 32 last second.  Network bytes sent: 176505566 total, 64750 last second.
2016/02/14 20:07:03 Network messages sent: 87227 total, 32 last second.  Network bytes sent: 176570316 total, 64750 last second.

What used to be 2000 messages per second was reduced to 32 per second. It also didn't crash the connection: I ran for over five hours before pulling the plug. ForeFlight was actually the weak link here -- it crashed at least once on each of the iDevices that I was using.

The other half of the improvements was doing something similar to the message queueing routine used to collect FIS-B weather / radar / NOTAM data. For that fix, I broke up the message queue into 256-message chunks, and strung those together before sending them to the clients. This is most likely to help when bringing a tablet out of standby, or switching to/from your EFB. Testing on the original method showed message rates of over 800 per second. With these improvements, I brought it down to several dozen per second.

Ergonomicmike commented 8 years ago

Cool! So in layman's terms, you are cutting down the overhead involved in passing one message at a time? (Among other things.)

JohnOCFII commented 8 years ago

Looks like some excellent improvements in efficiency. Keeping the WiFi network from becoming saturated is a good thing.

Ergonomicmike commented 8 years ago

Fantastic progress here. Should this issue be closed now? I don't know. I unplugged the Edimax on the v0.7b1 build and plugged it back in. It doesn't do the restart thing that @peepsnet & @AvSquirrel were kicking around above. Perhaps it doesn't need to now. But does it hurt to put restart in? Kinda like a belt & suspenders if ever the Edimax disconnects? My uninformed opinion is that since piaware has done it (and therefore tested it), the Stratux should do it too. I defer to the experts. (Update: I noticed after the fact that Chris put the "Bug" label on this. So it's still in play?")

cyoung commented 8 years ago

Still in play, the issue as the title states hasn't been addressed - but the situation w.r.t. disconnects is likely improved (needs more testing).

pp4life commented 8 years ago

I ran my 2 SDR no GPS/AHRS Version 0.7b1 Stratux on the 5V 2A wall plug that came with the Vilros kit to replicate issue #254. Hours later I come back and I'm getting nothing on the the iPad Mini webUI. I walk over to the Stratux and there's no blue LED on the Edimax. This is after running all day on the Anker battery with an iPod touch and iPad Mini connected to it at work. Log file attached of the session where the the shutdown happened. I can see a shutdown call to UAT after about 40 minutes and some other wierdness. I rebooted and everything came up just fine. I'll leave it running over night and see what happens. Maybe a WiFi restart isn't such a bad idea. stratuxlog.zip

Update: This morning everything was fine after running all night. However, I didn't have the iPod and the iPad connected all night.

cyoung commented 8 years ago

@Ergonomicmike - Give http://updates.stratux.me/builds/update-v0.7b3-857b71b716.sh a try - in testing (pulling the dongle and re-inserting) it will take ~30-45 secs to come back up.

Ergonomicmike commented 8 years ago

It didn't work at first. And then I remembered that I had done a git clone, which prevents the .sh update from working correctly. I reimaged back to v0.7b1 and updated via the .sh above. Edimax reconnects now when physically disconnected / unplugged. Yea!

It'll be interesting to see if this fix can recover when the blue light goes out. (Since bwkny got a BLOD today when he tried to connect to his iPad, and since he has measured 4.95v at his USB, the BLOD is looking like it's some kind of client negotiation problem. If I get a BLOD with these latest builds, I'm going to try Performance governor to give the kernel more oomph for the controller.)