Matthew-Hsu / PiPass

Nintendo 3DS Homepass for the Raspberry Pi
149 stars 20 forks source link

Hostapd Recovery - Invalid MAC Addresses #22

Closed nagledb closed 9 years ago

nagledb commented 9 years ago

After I recently updated to revision 9117f06b2d605483046f4db22ef5612ac8eac665, PiPass began failing for me with this message at the end of the logs:

2015-05-21 05:46:38,407 - __main__ - CRITICAL - Unable to start hostapd.

Sometimes it would happen immediately after I start PiPass. Other times, PiPass would run happily for various lengths of time (from a few minutes to about a full day) before terminating with that message.

To help rule out the possibility it was as hardware failure, I set up PiPass on a fresh SD card on a different Raspberry Pi with a different WiFi dongle. The issue happened on that setup as well.

I then rolled back to version 86ae9ec38f63a97fe4cde88e2c2db865b8f6c868 which I knew had been stable for me before. It's been running now for the past two days without issue.

Any idea what might have changed between 86ae9ec38f63a97fe4cde88e2c2db865b8f6c868 and 9117f06b2d605483046f4db22ef5612ac8eac665 that might cause intermittent issues with starting hostapd? If not, I can try bisecting to see if I can narrow down where the problem might have been introduced. But given the highly intermittent nature of the failure, that kind of exploration could take days.

Matthew-Hsu commented 9 years ago

Hmm... I believe I may have a guess to what is happening. I'm just wondering, are you using a Pi 1 or a Pi 2? I have PiPass running on a Pi 2 for about three days without issue.

I've switched to using subprocess.call() instead as I understood that it would wait for the process to finish before moving on. I would then check to see if hostapd was running and if it is not, then exit PiPass. I'm assuming there is a timing issue and I am wondering if it is only showing on the Pi 1. If you look at line 331 to 341 on master, you can see what I am talking about.

I'm wondering if the check I am doing is too unforgiving and does not account for all timing issues as I am guessing that subprocess.call() does not always guarantee this. If this is the case, I can relax that check.

If you can, could you comment out the exit(1) on line 341 and see if this issue persists? I am hoping that hostapd just needed some extra time to start up in these cases.

nagledb commented 9 years ago

Both of my Pis are Pi 2. I'm using the CanaKit you have referenced on the instructions: http://smile.amazon.com/gp/product/B008XVAVAW/

I updated to master and commented out exit(1) as you suggest. I'll try that and see how it goes.

nagledb commented 9 years ago

I've been running PiPass with that modification (commented out exit(1)) for the past approximately 40 hours with a 5 minute cycle time. Looking at the logs, I see 5 times where I received the "Unable to start hostapd" message. However, it then proceeds to work normally after that and I am continuing to receive StreetPasses. So it looks like you were correct in the cause of the issue. :)

Matthew-Hsu commented 9 years ago

That is good to hear. I've been trying to replicate the issue on my end, but I haven't yet. I've tried some aggressive settings with 1 minute intervals for about 40 hours as well and so far it seems that subprocess.call() does wait until hostapd fully starts up.

I am not sure why it is behaving this way for you, but I've implemented some code that will address this issue and it should allocate extra time for hostapd to fully start in those cases. I will push those changes later today.

Matthew-Hsu commented 9 years ago

I've pushed some changes that should account for the cases where subprocess.call() prematurely executes more code while hostapd is not fully ready.

It is interesting that it happens rarely for you and I am betting that in previous versions, hostapd just needed extra time too, but we just never noticed it without the logging system. With these checks, the dashboard should be quite accurate when there is a MAC change since a MAC change technically happens when hostapd has started.

If you have some time, could you test these changes out and see if it accounts for these cases? I have tried these changes locally for about 40 hours, but I could not get the logging system to record the case where it needed to retry starting hostapd up again.

Thanks!

nagledb commented 9 years ago

I just updated to the latest version and restarted. I'll let you know in a day or two how it goes. :)

Matthew-Hsu commented 9 years ago

Thanks a bunch! I really appreciate it and I hope it fixes the issue for you.

nagledb commented 9 years ago

Oops, it didn't work. :(

2015-05-27 05:58:54,655 - __main__ - WARNING - Unable to start hostapd. Attempting to start hostapd again.
2015-05-27 05:59:17,887 - __main__ - WARNING - Unable to start hostapd. Attempting to start hostapd again.
2015-05-27 05:59:41,144 - __main__ - WARNING - Unable to start hostapd. Attempting to start hostapd again.
2015-05-27 05:59:41,146 - __main__ - CRITICAL - Cannot start hostapd.

I just put in a 30 second sleep after failure to give it more time to recover. I figure since it was able to keep going before on a 5 minute interval, maybe it just needs a bit more time when it fails. I'll test it this way a bit and will follow up once I see how it does.

nagledb commented 9 years ago

Apparently 30 seconds wasn't enough... :( I'll try a higher time out.

It also just occurred to me: I'm using a custom list. Could a bad mac address cause this symptom? I'll update mine to output the mac to the log. Maybe a more appropriate response on failure would be to wait a little, then skip to the next address?

2015-05-28 02:29:14,739 - __main__ - WARNING - Unable to start hostapd. Will try again in 30 seconds.
2015-05-28 02:30:08,029 - __main__ - WARNING - Unable to start hostapd. Will try again in 30 seconds.
2015-05-28 02:31:01,280 - __main__ - WARNING - Unable to start hostapd. Will try again in 30 seconds.
2015-05-28 02:31:31,311 - __main__ - CRITICAL - Cannot start hostapd.
Matthew-Hsu commented 9 years ago

I tried this out myself and you are correct. An incorrect MAC address will fail hostapd, so it seems your list has a malformed MAC address and it explains why I am not seeing this issue on my end.

subprocess.call() should protect against timing issues, so I guess we have to think about an appropriate response. We could do something where if it detects a failure, then move onto the next zone and if a combo of failures occur (e.g., 3 in a row), then exit out of PiPass since we want to best guess that it is a driver issue VS a malformed MAC address. I probably should print out the MAC addresses too, so it would be beneficial to detect and remove these malformed MAC addresses.

Matthew-Hsu commented 9 years ago

I am not sure if I can be 100% that there is a driver issue VS an invalid MAC address. Counting consecutive failures won't work in the cases where someone will be using a particularly small list (e.g., 1). When hostapd is failing because of a driver issue, I cannot rely on consecutive failures in the particularly small list case. I could, however, keep a count of total maximum failures and just exit out when this threshold has been met.

I have pushed changes that log and post helpful warnings for people to remedy the invalid MAC address issue in the meantime.

nagledb commented 9 years ago

I left it running overnight with my change and it got tripped up over the mac address CB:8E:19:6B:D1:F2 with an SSID of attwifi. The mac address doesn't look invalid to me, so I'm a bit puzzled. (It comes from the "Personal MACs" tab of the "Remote StreetPass Relay MAC's" google doc.)

I just upgraded to the latest version and ran it again with a very short zone list and with your changes, it does indeed correctly stop and report when it encounters that mac address:

2015-05-28 14:04:24,727 - __main__ - ERROR - Unable to start hostapd with the MAC address: CB:8E:19:6B:D1:F2.
2015-05-28 14:04:24,730 - __main__ - WARNING - A possible WiFi driver issue or invalid MAC address has been detected.
2015-05-28 14:04:24,731 - __main__ - INFO - PiPass has been shutdown with an error.

So I consider this resolved now (though I'm still very curious as to how/why that mac address is invalid). Thanks for working to troubleshoot with me. :)

Matthew-Hsu commented 9 years ago

It is quite interesting. The MAC address where it is failing at seems to be valid. I will look into it later, but I am guessing that maybe hostapd has some type of valid MAC range?

I think I have a workaround in mind that should be able to distinguish between driver issues VS invalid MACs and provide some type of recovery. Ideally, it would be great to log all the invalid MACs in a go, so it is easier to clean up later.