FaradayRF / Faraday-Software

Faraday node software
https://www.faradayrf.com
Other
48 stars 19 forks source link

Update APRS Application to Handle Network Errors Gracefully #119

Closed kb1lqc closed 7 years ago

kb1lqc commented 7 years ago

Summary

APRS application will fail with an ERRNO 10054 when network connections are lost and it tries to send telemetry packets to APRS-IS. This should fail gracefully and automatically reconnect when the connection is reestablished.

Problem Explanation

N/A

Environment

Software

Python

Hardware

N/A

Supporting Information

N/A

kb1lqc commented 7 years ago

Ran a long duration test today and received the errors below. Test was started at 2017-07-21 07:48:29,348 and 2017-07-21 15:49:44,364 which is about 8 hours in duration before failure occurred.

2017-07-21 15:49:44,364 - APRS - ERROR - SendPosition
2017-07-21 15:49:44,364 - APRS - ERROR - [Errno 10054] An existing connection was forcibly closed by the remote host
2017-07-21 15:49:44,365 - APRS - ERROR - Sendtelemetry
2017-07-21 15:49:44,365 - APRS - ERROR - [Errno 10054] An existing connection was forcibly closed by the remote host
2017-07-21 15:49:44,365 - APRS - ERROR - SendTelemLabels
2017-07-21 15:49:44,365 - APRS - ERROR - [Errno 10054] An existing connection was forcibly closed by the remote host
2017-07-21 15:49:44,365 - APRS - ERROR - SendParameters
2017-07-21 15:49:44,365 - APRS - ERROR - [Errno 10054] An existing connection was forcibly closed by the remote host
2017-07-21 15:49:44,365 - APRS - ERROR - SendEquations
2017-07-21 15:49:44,365 - APRS - ERROR - [Errno 10054] An existing connection was forcibly closed by the remote host

APRS just kept sending once every minute and always failed after this time. Restarting APRS fixes the problem, no need to restart Proxy or Telemetry.

herrold commented 7 years ago

There are timeouts in UDP and TCP as to connections not 'kept alive. When was the last successful transmission, PRECEDING: 2017-07-21 15:49:44,364 - APRS - ERROR - SendPosition

if more than a few minutes -- I think the TCP expiration is 15 minutes under the RFC;s, for example, it may be necessary to tear down and re-start a connection. Some proxies make this even shorter for performance of capacity reasons

kb1lqc commented 7 years ago

Yeah we send once per minute so it should be OK. I think the solution to this is to detect the failure and simply reconnect. APRS as an applications needs a shuffle in its architecture to do this. Pretty simple but also not the highest priority for me at the moment.

Bryce

From: Russ Herrold [mailto:notifications@github.com] Sent: Saturday, July 22, 2017 3:02 PM To: FaradayRF/Faraday-Software Faraday-Software@noreply.github.com Cc: Bryce Salmi bryce@faradayrf.com; Author author@noreply.github.com Subject: Re: [FaradayRF/Faraday-Software] Update APRS Application to Handle Network Errors Gracefully (#119)

There are timeouts in UDP and TCP as to connections not 'kept alive. When was the last successful transmission, PRECEDING: 2017-07-21 15:49:44,364 - APRS - ERROR - SendPosition

if more than a few minutes -- I think the TCP expiration is 15 minutes under the RFC;s, for example, it may be necessary to tear down and re-start a connection. Some proxies make this even shorter for performance of capacity reasons

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/FaradayRF/Faraday-Software/issues/119#issuecomment-317213302 , or mute the thread https://github.com/notifications/unsubscribe-auth/AAUPFBfFIrrfg2wEe3oNq27xE87sEXf9ks5sQnFbgaJpZM4MOERN . https://github.com/notifications/beacon/AAUPFBSicnz77e0d5CWK7DIJslEfp-VKks5sQnFbgaJpZM4MOERN.gif

kb1lqc commented 7 years ago

OK picking this one back up. Per PR #266 I documented some work I did with aprsc to setup a local APRS-IS server I can connect to. This allows me to start and stop the server on demand such that I should be able to quickly test whether faraday-aprs is robust enough to try to reconnect.

Shout out to @hessu & @PhirePhly for the help getting aprsc up and running!

I have a local aprsc server running on virtualbox which will be crucial to timely debugging of faraday-aprs image

kb1lqc commented 7 years ago

Alright, when I connect to aprsc and then stop the server with sudo service aprsc stop I get an ERRNO 10053. Not what I saw before but close enough to know that it is related.

image

faraday-aprs should be sitting there trying to reconnect. This is goal number 1!

kb1lqc commented 7 years ago

I believe I figured it out in 5c47607!

To fix this issue I reorganized the connection scheme such that any IOError exceptions would cause the socket to close and return False. If the data was sent then the functions return True. This true/False is checked one layer up and if false then it will trigger a reconnect right from aprs.py.

The use of aprsc was instrumental to letting me quickly play around with this reconnect issue and not have intervals of hours but instead seconds!

Leaving this on for a long duration test! The test is connecting to ROTATE.APRS.NET tro force internet use.

kb1lqc commented 7 years ago

Hmm I left it on all day expecting it to work only to have it stop again! Running commit https://github.com/kb1lqc/Faraday-Software/commit/5c476074e46c152b45019d40114e36309d05d51f

⚠️ Looks like broken pipe ERRNO 10054, an ERRNO 11004, and an ERRNO 9 from the Python Socket library?

image

image

Seems to restart OK? image

kb1lqc commented 7 years ago

Well look at that. I had my own aprsc server running in the background on virtualhost (faraday-aprs was not connected to it) and it indicates at 20:38 UTC the bit-rate dropped to essentially zero. This is aprsc so it's a pretty darn good program! It recovered well, however, that's and ironic time because 20:39 UTC is 1:38PM in Los Angeles. This is EXACTLY when faraday-aprs lost connection in the logs above...

image

kb1lqc commented 7 years ago

YES YES YES!!! @kb1lqd @reillyeon @el-iso @jbemenheiser

While I was out doing errands today leaving faraday-aprs running with the latest commit on Issue #119 I have an unexpected network loss over the internet to APRS-IS. faraday-aprs attempted to connect and then when network connnectivity returned it kept on plugging away!

image

You can see that aprs.fi indicates packets not making it in during that time: image

It's still going strong sending position reports and telemetry back. I think this is going to finally be nice to setup as a base station!

el-iso commented 7 years ago

Sweet! Let me know what kind of setup you workout and I may try to mimic it!

Additionally, I was thinking about W5KMP and my range test a few monthes back and I think we experienced an aprs dropout and didn't realize. When you look at our aprs path from the test it cutout long before we were done and we didn't notice at the time because we had started to head back and weren't watching the console closely. I bet that wouldn't be an issue anymore!

On Sep 9, 2017 3:11 PM, "Bryce Salmi" notifications@github.com wrote:

YES YES YES!!! @kb1lqd https://github.com/kb1lqd @reillyeon https://github.com/reillyeon @el-iso https://github.com/el-iso @jbemenheiser https://github.com/jbemenheiser

While I was out doing errands today leaving faraday-aprs running with the latest commit on Issue #119 https://github.com/FaradayRF/Faraday-Software/issues/119 I have an unexpected network loss over the internet to APRS-IS. faraday-aprs attempted to connect and then when network connnectivity returned it kept on plugging away!

[image: image] https://user-images.githubusercontent.com/331540/30243525-16e34b5e-9560-11e7-9f3a-ee84a6b3e982.png

You can see that aprs.fi indicates packets not making it in during that time: [image: image] https://user-images.githubusercontent.com/331540/30243539-516c3312-9560-11e7-8068-b33f20049131.png

It's still going strong sending position reports and telemetry back. I think this is going to finally be nice to setup as a base station!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FaradayRF/Faraday-Software/issues/119#issuecomment-328300753, or mute the thread https://github.com/notifications/unsubscribe-auth/AXZdHkQBZuCP5aXKHZisxWhregLHM1Kvks5sgvDvgaJpZM4MOERN .

kb1lqc commented 7 years ago

You are probably correct about the distance test. Sorry, this was a bug I probably should have fixed a long time ago. Very glad we're focusing on using it for projects now and forced to make Faraday actually nice to use 😆 .

This is a standard Faraday hardware setup running 83ae953c7e2df8de302d937b9ff86e3ffd47e630. I have faraday-proxy, faraday-telemetry, and faraday-aprs configured and running to send APRS data to APRS-IS. That's it.

kb1lqc commented 7 years ago

A few more network errors and it's still going! I think this is working 😀 !

About an hour after the first network error image

Around 9:30PM another network error occurred... image

This one happened at 9:05AM in the morning before my first report a few comments ago which occurred around 12:35PM! image

kb1lqc commented 7 years ago

I put an RF connected Faraday radio on in the area and it automatically started tracking both of them :)

image

That RF unit is also throwing out some bad data at times and Faraday Software is rolling on past it! image

kb1lqc commented 7 years ago

OK I ended the test. Here are the results @kb1lqd @el-iso @reillyeon 😃

Start: 2017-09-08 22:19:19,063 End: 2017-09-10 09:40:15,236 Duration: 1 Day, 11 Hours, 20 minutes, and 56 seconds (> 35 hours)

Number of Stations: 2 Number of network error events: 7

This is a total success. I've never seen a faraday-aprs connection stay on this long.

24 Hour aprs.fi Telemetry Plots Below

image

image

Conclusion

I think this fix will address issues related to the following tickets Faraday-Software #119, Faraday-Software #190, Faraday-Software #199, Faraday-Software #201, Faraday-Software #262. A few of these tickets can likely be completely closed as a result of this testing and eventual PR!

kb1lqc commented 7 years ago

Closed with #271 merge!