g4klx / MMDVMHost

The host program for the MMDVM
GNU General Public License v2.0
369 stars 269 forks source link

Mmdvmhost stop works if lan goes down or disconnect ethernet cable #181

Closed iw9grl closed 7 years ago

iw9grl commented 7 years ago

Hi Recently with some Om friends, we noticed that if some trouble poweroff the Lan router connected to raspberry or disconnect manually the ethernet cable, mmdvmhost stop working and tg9 doesn't work. Is this issue known? 73

borjam commented 7 years ago

Yes. There is some problem with the reconnection logic.

It enters an endless loop. In my case I have observed it when resetting the wireless access point. And it may be related to DHCP, as a DHCP failure will make the interface acquire a self-assigned IP address.

E: 2016-10-06 15:26:57.668 Error returned from sendto, err: 101 E: 2016-10-06 15:26:57.680 Socket has failed when writing data to the master, re trying connection M: 2016-10-06 15:26:57.680 Closing DMR Network E: 2016-10-06 15:26:57.680 Error returned from sendto, err: 101 E: 2016-10-06 15:26:57.680 Socket has failed when writing data to the master, re trying connection M: 2016-10-06 15:26:57.680 Closing DMR Network E: 2016-10-06 15:26:57.681 Error returned from sendto, err: 101 E: 2016-10-06 15:26:57.681 Socket has failed when writing data to the master, re trying connection M: 2016-10-06 15:26:57.681 Closing DMR Network E: 2016-10-06 15:26:57.681 Error returned from sendto, err: 101 E: 2016-10-06 15:26:57.681 Socket has failed when writing data to the master, re trying connection M: 2016-10-06 15:26:57.681 Closing DMR Network E: 2016-10-06 15:26:57.681 Error returned from sendto, err: 101 E: 2016-10-06 15:26:57.681 Socket has failed when writing data to the master, re trying connection

g4klx commented 7 years ago

The MMDVM was never designed to run on an unreliable network connection. If your connection is unreliable expect problems.

borjam commented 7 years ago

Not criticizing :)

Anyway I'll have a look at it this weekend if possible. Some retry logic to just reconnect after a delay of several seconds should solve this.

g4klx commented 7 years ago

I think there is already retry logic in there, so I don't know why it isn't working. Git pulls are gladly accepted.

borjam commented 7 years ago

Sure! I'll pull your Git, not your leg :)

iw9grl commented 7 years ago

Hi Jonathan, thank you for the answer . I know that many updates ago this problem was not present, the MMDVMHost wrote in the log that the connection to the master was fail, ok I didn't worry about this, but the repeater locally was up ! Now something changes inside the code ..... if the network fails, the repeater stop to work and in some cases, when the network goes up mmdvmhost is not reachable remotely ; if the repeater is placed on the summit of a mountain it is a big problem !

thank you for your attention 73

g4klx commented 7 years ago

I think Tony G0WFV did that work. You should mention it on the MMDVM group and see if he responds. I don't remember anything changing in the DMR networking code for ages now.

iw9grl commented 7 years ago

OK , I ask to Tony ..... GRAZIE !!!!!!

73

g0wfv commented 7 years ago

For some reason my emails didn't make it through to here (So I'll C&P my reply manually!)


It's OK, I'm here too ...

Looking at the commits, I made the original change to the code back on 14 May to alleviate the errors when the host went into a endless loop of being unable to communicate with the modem when in fact it was communicating. The network connection would timeout and you would then lose the network connection until a restart so logic was added to retry the connection. This was further refined by Jon on 16 May to add a delay when attempting to reconnect.

The original files (DMRIPSC.cpp / h) were renamed to DMRNetwork.cpp / h on 14 Sep, but this section of code has remained unchanged since mid-May and continues to work as long as the network becomes available again. log.txt

I am unable to alter this any further as I do not have a repeater system to test against as I only run a DVMega hotspot - It's impossible for me to reproduce the symptoms locally!

That said (and this is a separate issue), I just went into my system to reproduce the network side of the issue and found the host had lost communications with the modem. A restart has failed to re-establish modem communications and this will require a full power-cycle of the RPi/DVmega to solve. This is happening at random intervals on both my DVMega systems and is a recent occurrence. I initially thought it was to do with the new YSF capable DVMega firmware so downgraded back to the last DMR/DStar only firmware (V2.29) but this is still occuring leading me to suspect a change to the modem code has created an issue?! As I say, something to keep an eye on separately!

Anyhoo, I shal re-cycle the power and investigate a means to alleviate the problem - obviously we need the network, else you're left with an unconnected repeater, but on the the other hand if the network is down for a protracted period, it would be nice to have some functionality, even if it's local only! I'll have a look.


Since writing the above email (which didn't make it to github!) I've had a little play with what I have available here. Here are my notes and assumptions ...

This isn't being generated by the section of code I wrote to solve the modem communication issues and master re-connect thereafter (unless Jon propagated the logic to other parts of the code too?)

Socket error 101 - Network unreachable

When you pull the LAN cable or restart your router, eth0 will go down (in my case unplug the WiFi dongle and wlan0 goes down) and as a consequence, it will alter the routing tables. Subsequently, the socket cannot send data to the master (it has not route!) and tells you.

For some reason, it is retrying immediately (as can be seen from your log excerpt - every 1ms!!) even though it uses the same retry timer - m_retryTimer(1000U, 10U) i can't recall how how Jon's timer code works, but I think this is every 10 seconds? - defined in the code as the other socket disconnection catching routines.

This will no doubt be hogging all the host's processing in an endless loop (there is an end - more in a bit!) therefore there is no other processing able to take place (which will include the processing and repeating of local RF and communication with the modem).

I can't see why this is not working, it doesn't seem to be working as intended in this case?

As I said there is an end to the loop - when the network is restored, it recovers as designed and the error stops ...

E: 2016-10-21 13:07:02.227 Socket has failed when writing data to the master, retrying connection
M: 2016-10-21 13:07:02.227 Closing DMR IPSC
E: 2016-10-21 13:07:02.227 Error returned from sendto, err: 101
E: 2016-10-21 13:07:02.227 Socket has failed when writing data to the master, retrying connection
M: 2016-10-21 13:07:02.227 Closing DMR IPSC
E: 2016-10-21 13:07:02.227 Error returned from sendto, err: 101
E: 2016-10-21 13:07:02.227 Socket has failed when writing data to the master, retrying connection
M: 2016-10-21 13:07:02.227 Closing DMR IPSC
E: 2016-10-21 13:07:02.228 Error returned from sendto, err: 101
E: 2016-10-21 13:07:02.228 Socket has failed when writing data to the master, retrying connection
M: 2016-10-21 13:07:02.228 Closing DMR IPSC
E: 2016-10-21 13:07:02.228 Error returned from sendto, err: 101
E: 2016-10-21 13:07:45.091 No reply from the modem for some time, resetting it
M: 2017-10-21 13:07:45.092 Closing the MMDVM
M: 2016-10-21 13:07:47.094 Opening the MMDVM
I: 2016-10-21 13:07:49.105 MMDVM protocol version: 1, description: DVMEGA HR2.29
E: 2016-10-21 13:07:49.126 Connection to the master has timed out, retrying connection
M: 2016-10-21 13:07:49.126 Closing DMR IPSC
M: 2016-10-21 13:07:55.096 Opening DMR IPSC
M: 2016-10-21 13:08:05.521 Logged into the master successfully
M: 2016-10-21 13:08:26.014 DMR Slot 2, received network voice header from G3UEQ to TG 9
M: 2016-10-21 13:08:27.590 DMR Slot 2, received network end of voice transmission, 1.6 seconds, 0% packet loss, BER: 0.0%

This isn't exactly helpful as whilst the network is down, I can see that it would be advantageous to still have local repeater communication available which is currently being prevented by the reconnect attempts every millisecond!

Personally, I can't see the quick fix, but Jon might ....

Jon - It's the CDMRNetwork::write function at line 527 in DMRNetwork.cpp that's the offender.

Tony

iw9grl commented 7 years ago

Hi Tony

very detailed explanation, thank you . I'm really sorry I can not help you, but I'm not able to understand in detail the code, but I can test everything you need on my system .

I think this is a big problem on the system considering that many repeaters are located in a very far location not easy to reach ; a dtmf remote control in this case is a must to reboot the rpi ( I'm working on that ) .

Anyway, I hope that a fast solution will be find .... Have a nice day Daniele IW9GRL

g0wfv commented 7 years ago

It's OK, I'm here too ...

Looking at the commits, I made the original change to the code back on 14 May to alleviate the errors when the host went into a endless loop of being unable to communicate with the modem when in fact it was communicating. The network connection would timeout and you would then lose the network connection until a restart so logic was added to retry the connection. This was further refined by Jon on 16 May to add a delay when attempting to reconnect.

The original files (DMRIPSC.cpp / h) were renamed to DMRNetwork.cpp / h on 14 Sep, but this section of code has remained unchanged since mid-May and continues to work as long as the network becomes available again.

I am unable to alter this any further as I do not have a repeater system to test against as I only run a DVMega hotspot - It's impossible for me to reproduce the symptoms locally!

That said (and this is a separate issue), I just went into my system to reproduce the network side of the issue and found the host had lost communications with the modem. A restart has failed to re-establish modem communications and this will require a full power-cycle of the RPi/DVmega to solve. This is happening at random intervals on both my DVMega systems and is a recent occurrence. I initially thought it was to do with the new YSF capable DVMega firmware so downgraded back to the last DMR/DStar only firmware (V2.29) but this is still occuring leading me to suspect a change to the modem code has created an issue?! As I say, something to keep an eye on separately!

Anyhoo, I shal re-cycle the power and investigate a means to alleviate the problem - obviously we need the network, else you're left with an unconnected repeater, but on the the other hand if the network is down for a protracted period, it would be nice to have some functionality, even if it's local only! I'll have a look.

Tony

On Fri, Oct 21, 2016 at 2:09 PM Jonathan Naylor notifications@github.com wrote:

I think Tony G0WFV did that work. You should mention it on the MMDVM group and see if he responds. I don't remember anything changing in the DMR networking code for ages now.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/g4klx/MMDVMHost/issues/181#issuecomment-255354976, or mute the thread https://github.com/notifications/unsubscribe-auth/AQf_4n3PbL53ewtHZEXduxAMP52zTUH1ks5q2J1-gaJpZM4KclTs .

A J Corbett

G0WFV

borjam commented 7 years ago

On Fri, Oct 21, 2016 at 3:57 PM, Tony Corbett notifications@github.com wrote:

For some reason my emails didn't make it through to here (So I'll C&P my

reply manually!)

Anyhoo, I shal re-cycle the power and investigate a means to alleviate the problem - obviously we need the network, else you're left with an unconnected repeater, but on the the other hand if the network is down for a protracted period, it would be nice to have some functionality, even if it's local only! I'll have a look.

When I have stumbled upon this issue, it didn´t require a power cycle. Just stopping and restarting mmdvmhost was enough.


This isn't being generated by the section of code I wrote to solve the modem communication issues and master re-connect thereafter (unless Jon propagated the logic to other parts of the code too?)

Socket error 101 - Network unreachable

When you pull the LAN cable or restart your router, eth0 will go down (in my case unplug the WiFi dongle and wlan0 goes down) and as a consequence, it will alter the routing tables. Subsequently, the socket cannot send data to the master (it has not route!) and tells you.

For some reason, it is retrying immediately (as can be seen from your log excerpt - every 1ms!!) even though it uses the same retry timer - m_retryTimer(1000U, 10U) i can't recall how how Jon's timer code works, but I think this is every 10 seconds? - defined in the code as the other socket disconnection catching routines.

This will no doubt be hogging all the host's processing in an endless loop (there is an end - more in a bit!) therefore there is no other processing able to take place (which will include the processing and repeating of local RF and communication with the modem).

I can't see why this is not working, it doesn't seem to be working as intended in this case?

As I said there is an end to the loop - when the network is restored, it recovers as designed and the error stops ... This isn't exactly helpful as whilst the network is down, I can see that it would be advantageous to still have local repeater communication available which is currently being prevented by the reconnect attempts every millisecond!

I've attached a full log from my system for analysis, but the pertinent part is reproduced above.

Personally, I can't see the quick fix, but Jon might ....

I'm looking at it. My gut feeling is that the problem might be related to the state changes of an interface when it loses the network.

Imagine this situation:

With the system connected (and remember that many installations rely on DHCP) the interface goes down. You have a socket bound to an IP address and connected to the master.

The network interface comes up again. However, the interface comes back with no network address until DHCP kicks in. Moreover, in case there is some network trouble (like, you are rebooting everything after a power cut) the interface might even get a temporary local "autoconfiguration" IP address, which surely won´t allow it to reach the Internet.

So, the proper recovery option in such a case is to destroy the socket, create a new one, and bind each time.

In my case the problem appeared when I rebooted an access point. My Raspberry is connected via WiFi. And there is a delay between the interface going "up" (associating with an AP) and having a valid IP address (DHCP completed).

I've been looking at it, but I'm not familiar with how the timer logic works in this program and, while I've programmed in C for more than 20 years, my C++ experience is limited to a small project 10 years ago.

Borja.

phl0 commented 7 years ago

Tony,

I could help you out if you need someone to test. Got an MMDVM in full duplex setup here on the bench ;-)

vy73 de Florian DF2ET

borjam commented 7 years ago

I've observed two different behaviors related to connectivity problems.

When the network interface goes down (in my case, a wireless adapter) the program gets crazy and enters an endless loop of retries.

However, the Spanish master for the Brandmeister network was down for an hour yesterday and it reconnected perfectly. It retried in 30 second invervals.

g0wfv commented 7 years ago

It's dependant on which error loop is entered in the code.....

On Fri, 28 Oct 2016, 13:41 Borja Marcos EA2EKH, notifications@github.com wrote:

I've observed two different behaviors related to connectivity problems.

When the network interface goes down (in my case, a wireless adapter) the program gets crazy and enters an endless loop of retries.

However, the Spanish master for the Brandmeister network was down for an hour yesterday and it reconnected perfectly. It retried in 30 second invervals.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/g4klx/MMDVMHost/issues/181#issuecomment-256890386, or mute the thread https://github.com/notifications/unsubscribe-auth/AQf_4mXE1ji87smljHwt-fKDBmqni4R-ks5q4dFHgaJpZM4KclTs .

A J Corbett

G0WFV

shawnchain commented 6 years ago

same issue here with the latest snapshot code. E: 2018-01-14 11:16:42.845 Error returned from sendto, err: 13 E: 2018-01-14 11:16:42.846 DMR, Socket has failed when writing data to the master, retrying connection M: 2018-01-14 11:16:42.846 DMR, Closing DMR Network M: 2018-01-14 11:16:42.846 DMR, Opening DMR Network E: 2018-01-14 11:16:52.853 Error returned from sendto, err: 13 E: 2018-01-14 11:16:52.853 DMR, Socket has failed when writing data to the master, retrying connection M: 2018-01-14 11:16:52.853 DMR, Closing DMR Network

g0wfv commented 6 years ago

This isn't an issue .....

As soon as you plug the cable in again it will (or should) recover. Error 13 is basically trying you that it can't access the network interface because it is inactive having been unplugged!

--

A J Corbett G0WFV

yo6nam commented 6 years ago

Workaround for this issue. Add to your cron : /10 * root /usr/bin/tail -4 /path_to/MMDVM-date -I.log | grep "err:" && systemctl restart mmdvmhost.service