LX3JL / xlxd

HAM radio multiprotocol dstar reflector server
GNU General Public License v3.0
163 stars 111 forks source link

make ambed more robust to handle device errors #226

Closed narspt closed 1 year ago

narspt commented 1 year ago

This patch improves ambed to be able to handle and recover from device communication errors/timeouts.

Before this patch, when ambed failed to get reply from a device for some packet (1 or 2 depending of device fifo size...) then such device would not be usable anymore until restarting ambed (because ​m_iSpeechFifolLevel or m_iChannelFifolLevel vars would never decrement), the patch implements a timeout to restart these and a check to verify if the device is unresponsive, if yes then close and reopen/reinitialize the device. These problems are very frequent on some systems like Raspberry Pi requiring ambed to be restarted from time to time.

I did also include code to drop lagged packets on device queue after channel is closed, these lagged packets can accumulate if device have performance issues (like simultaneously using all channels full-duplex on DVStick33) or if some "weird" client on XLX sends excessive dv frames for some reason... it makes no sense to continue processing them after channel closed, and actually if same channel gets open again for other stream we need it free and not still processing lagged packets... This also solves the frequent "Unexpected transcoded packet received from ambed" seen on XLX log ( https://github.com/LX3JL/xlxd/issues/207 ), these unexpected packets actually belong to previous streams, these are usually retained (1 or 2 packets) waiting enough packets to fill fifo before restarting communication with device or lagged packets.

n5qm commented 1 year ago

Thank you for this!

I merged this into my local tree to see if it makes ambed more reliable. It needs frequent restarts on my VM in it's current state. I will report back in a few days on the differences.

narspt commented 1 year ago

@n5qm: I'm still curious if it did help with your problem? Btw, some of my patchs (incl. this one) are already merged on master.

jranma commented 1 year ago

@narspt @LX3JL I applied this patch. After a few weeks, my experience shows that it helps, but I still have to restart the system frequently, as the connection keeps getting lost and cannot come back automatically.

I use a transcoding key on a raspberry pi, with dynamic IP address. I think the problem is not even the change of IP, because this IP remains very stable.

Do you see other ways to improve? 73s.

narspt commented 1 year ago

@jranma: You need to restart the whole operating system on your raspberry pi !? What happens exactly? It's hard to say but I guess you have some other problem... a completely "blind" suggestion but always valid: make sure you have a good power supply and a good charging capable usb power cable to power your raspberry pi.

jranma commented 1 year ago

@narspt not the whole system. I need to restart xlxd and ambed (not sure if both need to be restarted after connexion in lost, and not sure which one needs to be restarted first...) Raspberry has been very stable so I would not blame the power supply.

my experience is that after some time the connection is lost between xlxd and ambed and fails to restore automatically. It seems to me that the latest patches help (at least, they don't hurt) but the problem is still not fully solved here. It would need further investigation...

narspt commented 1 year ago

@jranma: Try to identify if you need to restart xlxd or ambed to cure it, it's very unlikely that you need to restart both. Also you may post xlxd log of when it happens maybe it helps to identify what happened. (and surely make sure that you updated and recompiled both xlxd or ambed as both include fixes)

Btw, you did mention dynamic ip above... note that it would be a problem if ambed side ip changes, even if you use a dynamic dns host, please note that xlxd will not re-resolve the host name, it will resolve it only once at startup and use that ip forever, only when you restart xlxd it will resolve it to the new ip, maybe this is your problem? You may note down your ip and check if it is different when the problem happens.

jranma commented 1 year ago

@narspt We are making progress. The communication was not working between ambed (dynamic IP, raspberry) and XLXd (in the cloud). A restart of xlxd (only xlxd) has restored the communication.

On the other hand, the dynamic IP doesn't seem to have changed for a long time, so it seems unlikely that the problem comes from there.

something must stop the communication after a while (a few days?) but the logs are not very meaningful to understand the reason. The only line I found in the log is: "Apr 25 20:08:14 saturne xlxd: ambed openstream timeout" some seconds after this, the ambed log would be: "activity timeout"

Now I will try this workarround (cron job)

#!/bin/bash
if systemctl status xlxd | grep "ambed openstream timeout"; then
    systemctl restart xlxd
    echo "$(date) JS restart xlxd perte connexion ambed" >> /var/log/messages
else
    exit 0
fi
phl0 commented 1 year ago

xlxd logs to syslog bit with an easy config you can redirect that to a separate file. Config snippet from my rsyslog config (/etc/rsyslog.d/29-xlxd.conf):

if $programname == 'xlxd' then -/var/log/xlxd.log
if $programname == 'xlxd' then stop
narspt commented 1 year ago

@jranma: if you get "ambed openstream timeout" error and after that all streams crash (with high number of ambed timedout packets on log) until you restart xlxd then you may need this patch: https://github.com/LX3JL/xlxd/pull/222 (currently included on master)

You said above on your 1st post here "I applied this patch. After a few weeks, ..." then I guess you are using non-latest version from git master, with just this patch #226 manually applied? I suggest you instead get fresh and updated xlxd and ambed from master and recompile them both, they will include both patchs as they are now both merged on master, then test again. Btw, anyway note down your ip because when it changes you will loose communication (it will show "Transcoder keepalive timeout" on log), but that is another issue we may discuss later if relevant.

jranma commented 1 year ago

@narspt both patches seem to have been merged on march 1. I downloaded and recompiled on march 6 so I guess both patches are running on my systems now.

First time I saw "ambed openstream timeout" was on April 25th, almost two month after recompilation. Not that bad. So I can say that the system is quite stable, even if a disconneted happened and it didn't work anymore from the 25th.

I will test my "crontab" workarround. Should do the trick.

narspt commented 1 year ago

Out of ideas for your problem, sorry :(

jranma commented 1 year ago

@narspt Under what circumstances can the "ambed openstream timeout" message appear, after the patch is applied? we can exclude the change of IP address, which did not happen. Can we dig a little further?

narspt commented 1 year ago

(note: this is completely off-topic here, this PR is related to improve ambed to be able to deal with hardware device communication errors and we are discussing some issue apparently happening at xlxd side)

@jranma Every time a stream is started on reflector then xlxd sends an openstream packet to ambed to open a transcoding channel dedicated to transcode that new reflector stream, ambed should them reply with an ack (or reject) packet, xlxd waits for this ambed reply for 200ms* and if no reply comes back from ambed within this time it prints the "ambed openstream timeout" message and the reflector stream continues without transcoding. This happens independently of https://github.com/LX3JL/xlxd/pull/222 patch applied or not. If you read that PR description you can see that the patch is intended to solve a problem/bug that did happen when the ambed openstream reply is delayed (due to network latency) and still comes back to xlxd after the 200ms timeout, when this occurs it would leave xlxd in a crashed state and all subsequent streams would crash and show lots of timeout packets on xlxd log, requiring xlxd restart to fix.

In sum, actually the "ambed openstream timeout" message doesn't exactly mean a fatal error itself, it just means that reflector was unable to start a transcoding channel for a specific reflector stream, due to network latency or packet loss between xlxd and ambed, however without the https://github.com/LX3JL/xlxd/pull/222 patch it was very likely that after this happens the xlxd could be left in a crashed state (if ambed reply still comes to xlxd after timeout else if completely lost no problem) due to a bug, with the patch this should not happen and xlx continues to operate normally for next streams despite the message printed to log just as information.

*note: concerning this 200ms timeout, surely if we get such "ambed openstream timeout" messages frequently then it means that we have high latency between xlxd and ambed, in that case it may be convenient to tweak such timeout at ctranscoder.cpp file, changing the value of the AMBED_OPENSTREAM_TIMEOUT constant. I can tell you that I'm in the Europe and I allow an US guy to connect his xlx to my ambed with great success for almost one year now, but he had to tweak this timeout to 400ms because with 200ms timeout would occur from time to time, anyway this setup of xlx on US and ambed on EU is probably a rare case, but surprisingly it works very well apparently.

Concerning your problem, I'm out of ideas really... having the https://github.com/LX3JL/xlxd/pull/222 patch actually the "ambed openstream timeout" message on your log should not mean a problem but just an occasional timeout for that stream as I said above... you may be looking at that message but it may be unrelated to when your problem starts? hard to say... if you could post (or use pastebin) a whole log maybe I have some idea from something else...

jranma commented 1 year ago

@narspt thanks for all the info. I restarted xlxd yesterday and will watch for the "ambed openstream timeout" message. When I catch it, xlxd will simply be restarted. If it happens only 1-2 times per month, I'm fine with it. Not the most academic workarround, but my goal is to have a working solution. I'll let you know in some weeks.