Closed kb1lqc closed 7 years ago
Ran a long duration test today and received the errors below. Test was started at 2017-07-21 07:48:29,348 and 2017-07-21 15:49:44,364 which is about 8 hours in duration before failure occurred.
2017-07-21 15:49:44,364 - APRS - ERROR - SendPosition
2017-07-21 15:49:44,364 - APRS - ERROR - [Errno 10054] An existing connection was forcibly closed by the remote host
2017-07-21 15:49:44,365 - APRS - ERROR - Sendtelemetry
2017-07-21 15:49:44,365 - APRS - ERROR - [Errno 10054] An existing connection was forcibly closed by the remote host
2017-07-21 15:49:44,365 - APRS - ERROR - SendTelemLabels
2017-07-21 15:49:44,365 - APRS - ERROR - [Errno 10054] An existing connection was forcibly closed by the remote host
2017-07-21 15:49:44,365 - APRS - ERROR - SendParameters
2017-07-21 15:49:44,365 - APRS - ERROR - [Errno 10054] An existing connection was forcibly closed by the remote host
2017-07-21 15:49:44,365 - APRS - ERROR - SendEquations
2017-07-21 15:49:44,365 - APRS - ERROR - [Errno 10054] An existing connection was forcibly closed by the remote host
APRS just kept sending once every minute and always failed after this time. Restarting APRS fixes the problem, no need to restart Proxy or Telemetry.
There are timeouts in UDP and TCP as to connections not 'kept alive. When was the last successful transmission, PRECEDING: 2017-07-21 15:49:44,364 - APRS - ERROR - SendPosition
if more than a few minutes -- I think the TCP expiration is 15 minutes under the RFC;s, for example, it may be necessary to tear down and re-start a connection. Some proxies make this even shorter for performance of capacity reasons
Yeah we send once per minute so it should be OK. I think the solution to this is to detect the failure and simply reconnect. APRS as an applications needs a shuffle in its architecture to do this. Pretty simple but also not the highest priority for me at the moment.
Bryce
From: Russ Herrold [mailto:notifications@github.com] Sent: Saturday, July 22, 2017 3:02 PM To: FaradayRF/Faraday-Software Faraday-Software@noreply.github.com Cc: Bryce Salmi bryce@faradayrf.com; Author author@noreply.github.com Subject: Re: [FaradayRF/Faraday-Software] Update APRS Application to Handle Network Errors Gracefully (#119)
There are timeouts in UDP and TCP as to connections not 'kept alive. When was the last successful transmission, PRECEDING: 2017-07-21 15:49:44,364 - APRS - ERROR - SendPosition
if more than a few minutes -- I think the TCP expiration is 15 minutes under the RFC;s, for example, it may be necessary to tear down and re-start a connection. Some proxies make this even shorter for performance of capacity reasons
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/FaradayRF/Faraday-Software/issues/119#issuecomment-317213302 , or mute the thread https://github.com/notifications/unsubscribe-auth/AAUPFBfFIrrfg2wEe3oNq27xE87sEXf9ks5sQnFbgaJpZM4MOERN . https://github.com/notifications/beacon/AAUPFBSicnz77e0d5CWK7DIJslEfp-VKks5sQnFbgaJpZM4MOERN.gif
OK picking this one back up. Per PR #266 I documented some work I did with aprsc to setup a local APRS-IS server I can connect to. This allows me to start and stop the server on demand such that I should be able to quickly test whether faraday-aprs
is robust enough to try to reconnect.
Shout out to @hessu & @PhirePhly for the help getting aprsc up and running!
I have a local aprsc server running on virtualbox which will be crucial to timely debugging of faraday-aprs
Alright, when I connect to aprsc and then stop the server with sudo service aprsc stop
I get an ERRNO 10053. Not what I saw before but close enough to know that it is related.
faraday-aprs
should be sitting there trying to reconnect. This is goal number 1!
I believe I figured it out in 5c47607!
To fix this issue I reorganized the connection scheme such that any IOError exceptions would cause the socket to close and return False. If the data was sent then the functions return True. This true/False is checked one layer up and if false then it will trigger a reconnect right from aprs.py
.
The use of aprsc was instrumental to letting me quickly play around with this reconnect issue and not have intervals of hours but instead seconds!
Leaving this on for a long duration test! The test is connecting to ROTATE.APRS.NET tro force internet use.
Hmm I left it on all day expecting it to work only to have it stop again! Running commit https://github.com/kb1lqc/Faraday-Software/commit/5c476074e46c152b45019d40114e36309d05d51f
⚠️ Looks like broken pipe ERRNO 10054, an ERRNO 11004, and an ERRNO 9 from the Python Socket library?
Seems to restart OK?
Well look at that. I had my own aprsc server running in the background on virtualhost (faraday-aprs
was not connected to it) and it indicates at 20:38 UTC the bit-rate dropped to essentially zero. This is aprsc so it's a pretty darn good program! It recovered well, however, that's and ironic time because 20:39 UTC is 1:38PM in Los Angeles. This is EXACTLY when faraday-aprs
lost connection in the logs above...
YES YES YES!!! @kb1lqd @reillyeon @el-iso @jbemenheiser
While I was out doing errands today leaving faraday-aprs
running with the latest commit on Issue #119 I have an unexpected network loss over the internet to APRS-IS. faraday-aprs
attempted to connect and then when network connnectivity returned it kept on plugging away!
You can see that aprs.fi indicates packets not making it in during that time:
It's still going strong sending position reports and telemetry back. I think this is going to finally be nice to setup as a base station!
Sweet! Let me know what kind of setup you workout and I may try to mimic it!
Additionally, I was thinking about W5KMP and my range test a few monthes back and I think we experienced an aprs dropout and didn't realize. When you look at our aprs path from the test it cutout long before we were done and we didn't notice at the time because we had started to head back and weren't watching the console closely. I bet that wouldn't be an issue anymore!
On Sep 9, 2017 3:11 PM, "Bryce Salmi" notifications@github.com wrote:
YES YES YES!!! @kb1lqd https://github.com/kb1lqd @reillyeon https://github.com/reillyeon @el-iso https://github.com/el-iso @jbemenheiser https://github.com/jbemenheiser
While I was out doing errands today leaving faraday-aprs running with the latest commit on Issue #119 https://github.com/FaradayRF/Faraday-Software/issues/119 I have an unexpected network loss over the internet to APRS-IS. faraday-aprs attempted to connect and then when network connnectivity returned it kept on plugging away!
[image: image] https://user-images.githubusercontent.com/331540/30243525-16e34b5e-9560-11e7-9f3a-ee84a6b3e982.png
You can see that aprs.fi indicates packets not making it in during that time: [image: image] https://user-images.githubusercontent.com/331540/30243539-516c3312-9560-11e7-8068-b33f20049131.png
It's still going strong sending position reports and telemetry back. I think this is going to finally be nice to setup as a base station!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FaradayRF/Faraday-Software/issues/119#issuecomment-328300753, or mute the thread https://github.com/notifications/unsubscribe-auth/AXZdHkQBZuCP5aXKHZisxWhregLHM1Kvks5sgvDvgaJpZM4MOERN .
You are probably correct about the distance test. Sorry, this was a bug I probably should have fixed a long time ago. Very glad we're focusing on using it for projects now and forced to make Faraday actually nice to use 😆 .
This is a standard Faraday hardware setup running 83ae953c7e2df8de302d937b9ff86e3ffd47e630. I have faraday-proxy
, faraday-telemetry
, and faraday-aprs
configured and running to send APRS data to APRS-IS. That's it.
A few more network errors and it's still going! I think this is working 😀 !
About an hour after the first network error
Around 9:30PM another network error occurred...
This one happened at 9:05AM in the morning before my first report a few comments ago which occurred around 12:35PM!
I put an RF connected Faraday radio on in the area and it automatically started tracking both of them :)
That RF unit is also throwing out some bad data at times and Faraday Software is rolling on past it!
OK I ended the test. Here are the results @kb1lqd @el-iso @reillyeon 😃
Start: 2017-09-08 22:19:19,063 End: 2017-09-10 09:40:15,236 Duration: 1 Day, 11 Hours, 20 minutes, and 56 seconds (> 35 hours)
Number of Stations: 2 Number of network error events: 7
This is a total success. I've never seen a faraday-aprs
connection stay on this long.
24 Hour aprs.fi Telemetry Plots Below
I think this fix will address issues related to the following tickets Faraday-Software #119, Faraday-Software #190, Faraday-Software #199, Faraday-Software #201, Faraday-Software #262. A few of these tickets can likely be completely closed as a result of this testing and eventual PR!
Closed with #271 merge!
Summary
APRS application will fail with an ERRNO 10054 when network connections are lost and it tries to send telemetry packets to APRS-IS. This should fail gracefully and automatically reconnect when the connection is reestablished.
Problem Explanation
N/A
Environment
Software
Python
Hardware
N/A
Supporting Information
N/A