Beckhoff / ADS

Beckhoff protocol to communicate with TwinCAT devices.
MIT License
502 stars 194 forks source link

Hot-plugging a PLC/ automatic re-connection #113

Closed jstiefel closed 4 years ago

jstiefel commented 4 years ago

I'm wondering if there's a way to proper handle "hot-pluggable" connections. If we start the PLC (CX9020) before the Ads library script, everything works fine. However, I'm interested in running a script on an embedded system and connect to the PLC as soon as it is available. The PLC can be turned on and off while the script is still running. I'm thinking about a loop, which lets us connect and disconnect the PLC.

Shortened pseudo code:

while true {
  if (!isConnected()) {
    AdsPortCloseEx();
    AdsDelRoute();
    AdsAddRoute();
    AdsPortOpenEx();
  }
}

If there's no connection, I'm basically trying to do a reset, since I don't know if there was a connection before. However, using AdsPortCloseEx() and AdsDelRoute() does not seem to bring me to the initial state.

I'm often getting two errors, depending on the IPs I choose:

Connect TCP socket failed with: 110

and

Connect TCP socket failed with: 113

While these errors are not alarming since I don't have a PLC connected, I'm wondering why I only get them on the initial connection. This makes me think that the "reset" alone is not working (Restarting the script works fine). Furthermore, I'm wondering how I can avoid 110 (time out), which blocks the code.

Any help would be appreciated.

pbruenn commented 4 years ago

Good question. I would have done it exactly like you did. AdsPortClose() is supposed to unregister any ADS notifications registered with this port and AdsDelRoute() should close the TCP connection. So a second call to AddRoute() should renew this TCP connection. I have to setup an example myself to check why this doesn't seem to work. I think without modification to AdsLib you cannot avoid 110. We have AdsSetTimeout() to control the amount of time waiting for read/write request. But that timeout currently has no effect on the initial TcpSocket::Connect

pbruenn commented 4 years ago

I am currently validating this patch:

diff --git a/AdsLib/Sockets.cpp b/AdsLib/Sockets.cpp
index 55371f4..f11a7db 100644
--- a/AdsLib/Sockets.cpp
+++ b/AdsLib/Sockets.cpp
@@ -193,7 +193,7 @@ uint32_t TcpSocket::Connect() const

     if (::connect(m_Socket, reinterpret_cast<const sockaddr*>(&m_SockAddress), sizeof(m_SockAddress))) {
         LOG_ERROR("Connect TCP socket failed with: " << WSAGetLastError());
-        return 0;
+        throw std::runtime_error("Connect TCP socket failed with: " + WSAGetLastError());
     }

     struct sockaddr_in source;
@@ -201,7 +201,7 @@ uint32_t TcpSocket::Connect() const

     if (getsockname(m_Socket, reinterpret_cast<sockaddr*>(&source), &len)) {
         LOG_ERROR("Read local tcp/ip address failed");
-        return 0;
+        throw std::runtime_error("Read local tcp/ip address failed");
     }
     LOG_INFO("Connected to " << ((addr & 0xff000000) >> 24) << '.' << ((addr & 0xff0000) >> 16) << '.' <<
              ((addr & 0xff00) >> 8) << '.' << (addr & 0xff));
-- 
jstiefel commented 4 years ago

Thanks for your quick response @pbruenn.

So a second call to AddRoute() should renew this TCP connection.

I think this is not happening right now. Is your patch addressing this issue? I'm not sure if I understand correctly, but the unsuccessful connection itself does not seem to be a problem.

I think without modification to AdsLib you cannot avoid 110. We have AdsSetTimeout() to control the amount of time waiting for read/write request. But that timeout currently has no effect on the initial TcpSocket::Connect

Thinking about it again, this blocking does not really seem to be a problem since during a timeout, I would still get a connection (I assume). What I'm curious about is why I sometimes end up with 110 and sometimes 113. For example, using the IP range of my second network card (192.168.2.130) and trying to connect to 192.168.2.136, I'm getting 110. Trying some random IPs all end up in 110. However, if I just change the first two digits ending up with 169.254.2.136, I'm getting 113.

pbruenn commented 4 years ago

Yes, that patch should fix your problem. Previously the connection timeout was not recognized by the AmsRouter and a new AmsConnection was established even if it hadn't a TCP connection associated. On your second AddRoute() the router would find the old (non working) AmsConnection and would just use it. Now, with this exception. You can catch it(or get an error code from AddRoute) and can retry until the connection succeeds. However the problem to detect a TCP disconnect during a running router still remains. Currently I have no resources to work on long standing problem. 110 and 113 is an error from your operating systems TCP/IP stack. AdsLib just forwards it. 110 = Connection timed out 113 = No route to host

jstiefel commented 4 years ago

Previously the connection timeout was not recognized by the AmsRouter and a new AmsConnection was established even if it hadn't a TCP connection associated.

Thanks for the explanation. I understand. Since I'm usually importing this repo directly from my CMake file, are you going to change this, too?

On your second AddRoute() the router would find the old (non working) AmsConnection and would just use it.

I don't have too much in-depth knowledge about your library yet since I just started working with it. But I think the problem with what I want to do is that there's currently no concept of re-initializing the objects responsible for the connection, right? (i.e. the static AmsRouter& GetRouter in the try of AdsAddRoute) So once this AdsAddRoute succeeds, the AmsConnection is set with the associated TCP connection. And AdsDelRoute does just close the connection, but not completely revert what AdsAddRoute has done.

Now, with this exception. You can catch it(or get an error code from AddRoute) and can retry until the connection succeeds.

Now, on initial connection, this works as expected. I'm looping until the PLC connects as soon as it is plugged in.

However the problem to detect a TCP disconnect during a running router still remains.

I see. The script keeps telling me that adding the route succeeds and the port is open still after unplugging the PLC. But I'm not sure if this is a problem. Once the PLC is connected for the first time, the TCP connection is properly set up. Now, if we physically disconnect it, I can set a "connected"-flag to false during a read error. Since the TCP connection is somehow approved now, it should just reconnect again once plugged-in as long as we don't switch the IP and AmsNetId during operation. Or am I missing something? Is it wrong to let the script add a route and open a port which are "non-existent"?

pbruenn commented 4 years ago

Yes, I just pushed the fix. So if you build from latest master, you should have it, now.

When the connection is lost and you get a read or write error. You would have to reinitialize the TCP connection manually. What should work is your strategy of AdsDelRoute; AdsAddRoute; that should renew the connection.

jstiefel commented 4 years ago

But,

What should work is your strategy of AdsDelRoute; AdsAddRoute; that should renew the connection.

does not fit:

On your second AddRoute() the router would find the old AmsConnection and would just use it.

And that's also what I'm seeing. Once connected and physically disconnected, AdsAddRoute() and AdsPortOpenEx() still succeed.

pbruenn commented 4 years ago

just to make sure we are talking about the same thing here:

jstiefel commented 4 years ago

Nearly. Since I'm using

while true {
  if (!isConnected()) {
    AdsPortCloseEx();
    AdsDelRoute();
    AdsAddRoute();
    AdsPortOpenEx();
  }
}

the scenario is as follows (adding the port opening and changes in bold):

Seems like I'm already doing what you propose. I'm just concerned about the two lines marked with "XX". Why do these report success without any connection? They won't do this on the initial connection. I would expect a timeout again.

pbruenn commented 4 years ago

Is this still happening after the patch?

jstiefel commented 4 years ago

I was wrong. AdsPortOpen now successfully throws the exception in both cases. Thank you for your help.