UniversalRobots / Universal_Robots_Client_Library

A C++ library for accessing the UR interfaces that facilitate the use of UR robotic manipulators by external applications.
Apache License 2.0
126 stars 94 forks source link

TCP server implementation #46

Closed fmauch closed 3 years ago

fmauch commented 3 years ago

This implements an event-based tcp server for easier use. This could be a solution for #45

In this commit I only implemented it for the ScriptSender, but it could also be used (probably with modifications) inside the ReverseInterface.

urmahp commented 3 years ago

This implements an event-based tcp server for easier use. This could be a solution for #45

In general I think this is a really good solution and I think the event-based tcp is a good approach. I have tested that it works with the ScriptSender, it works very well and it solves #45.

In this commit I only implemented it for the ScriptSender, but it could also be used (probably with modifications) inside the ReverseInterface.

As you mention I also think that it should be used inside the ReverseInterface. I could look into implementing it in the reverse interface if needed?

fmauch commented 3 years ago

As you mention I also think that it should be used inside the ReverseInterface. I could look into implementing it in the reverse interface if needed?

I went ahead and implemented it, but you could test it a bit more, if you'd like.

Currently, stopping the ROS driver with CTRL-C results in a RTDE pipeline producer overflow that I still need to hunt down

urmahp commented 3 years ago

I went ahead and implemented it, but you could test it a bit more, if you'd like.

I will test that it works with the reverse interface as well.

Currently, stopping the ROS driver with CTRL-C results in a RTDE pipeline producer overflow that I still need to hunt down

It is related to the pipeline running in the RTDEClient stopping the pipeline, when the object is destroyed, should fix that issue. Below example could be a solution:

RTDEClient::~RTDEClient()
{
  pipeline_.stop();
}
fmauch commented 3 years ago

It is related to the pipeline running in the RTDEClient stopping the pipeline, when the object is destroyed, should fix that issue. Below example could be a solution:

RTDEClient::~RTDEClient()
{
  pipeline_.stop();
}

Yep, that does the trick :-)

fmauch commented 3 years ago

@urmahp I'm done with implementing this ATM, please go ahead and test it as much as you can :-)

Before merging, I'd like to also refactor the tcp_client a bit more so it is more understandable and introduce unit tests for the server and client. Let me know, if you'd like to participate in one of these.

urmahp commented 3 years ago

@urmahp I'm done with implementing this ATM, please go ahead and test it as much as you can :-)

I will be happy to test it.

Before merging, I'd like to also refactor the tcp_client a bit more so it is more understandable and introduce unit tests for the server and client. Let me know, if you'd like to participate in one of these.

I can work on some unit test for the server and client, while testing it.

urmahp commented 3 years ago

I have tested that it works with both the ROS driver and the Isaac driver and overall it is working very well.

However I see one issue once in a while, it is not occuring every time, only once in a while. When stopping the urcap program running on the robot, it seems like the client is not correctly disconnected from the server, because we keep sending data, but there is no client connected. Also the handle_program_state function isn't called. When playing the program again I get the error that the client is already connected, so that also points towards the client not being disconnected correctly. I have inserted the log output below:

INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 73: New client connected at FD 12. 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 85: Robot requested program 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 123: Sent program to robot 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 78: Client at FD 12 disconnected. 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/reverse_interface.h 155: Robot connected to reverse interface. Ready to receive control commands. 
Program running: true

ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 305: Sending data through socket failed. 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/reverse_interface.h 168: Connection to reverse interface dropped. 
Program running: false

INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 73: New client connected at FD 12. 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 85: Robot requested program 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 123: Sent program to robot 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/script_sender.h 78: Client at FD 12 disconnected. 
INFO /home/mahp/git/github/forks/Universal_Robots_Client_Library/include/ur_client_library/comm/reverse_interface.h 155: Robot connected to reverse interface. Ready to receive control commands. 
Program running: true

ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 267: recv() failed 
ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 305: Sending data through socket failed. 
ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 305: Sending data through socket failed. 
ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 305: Sending data through socket failed. 

Unfortunately I haven't been able to track down the issue completely, but I believe it might have something to do with that the disconnect callback functions isn't called, because the server maybe is in a specific state at the point of disconnect. The error message ERROR /home/mahp/git/github/forks/Universal_Robots_Client_Library/src/comm/tcp_server.cpp 267: recv() failed is only printed once the error occurs.

fmauch commented 3 years ago

I'll try to look into it tonight.

fmauch commented 3 years ago

Hm, I once hit

[ INFO] [1615925511.466963882]: Client at FD 23 disconnected.
[ INFO] [1615925511.491965322]: Robot connected to reverse interface. Ready to receive control commands.
[ WARN] [1615925514.072904746]: Connection attempt on port 50001 while maximum number of clients (1) is already connected. Closing connection.
[ERROR] [1615925514.225187393]: recv() failed
[ERROR] [1615925514.225742741]: Sending data through socket failed.
[ERROR] [1615925514.233970460]: Sending data through socket failed.
[... more of those]

I finally found a way to reproduce this:

  1. Play the program on the TP -> Wireshark shows me that the machines are exchanging [PSH, ACK] messages
  2. Pause the program on the TP -> Communication is stopped on a TCP level, where the robot sends a last [ACK] package.
  3. Stop the program on the TP -> ´recv() failed` on the reverse interface. -> This will close the socket connection. Wireshark shows me that the robot sends a [RST, ACK] package on tcp level.
  4. Play the program on the TP -> Sending data fails. The connection attempt from the robot is declined (closed again immediately) leaving us with no active socket on both ends. I guess, the script hangs in this line as we didn't see the log message "ExternalControl: External control active" on the TP)

Open questions as for now:

  1. Is this different with the current master branch (I assume, the answer is "yes") Edit: -> Yes. The master branch works fine. On a TCP package level things behave the same (Why wouldn't they?) and the connection gets closed properly when pressing "stop" on the TP.
  2. Why is it that way? Edit2: -> This line returns false, as UrServer.readLine() fails due to ::read fails here. All this will end [this loop] reconstructing the complete ReverseInterface. I was quite happy to get rid of this loop, destruct, reset structure... Anyway, in my new implementation we do close the socket in case of a read failure, but we don't call the disconnectCallback here. Therefore, the ReverseInterface doesn't know it's not connected anymore.
  3. Where does that [RST, ACK] message come from? Edit2: -> Honestly I don't care, as it should be working with that, anyway. Edit3: -> Actually, I do care, as this produces a read failure on our side of the socket.
  4. How do we handle this? Edit3: -> 341b34f should fix it. However, in that case we still get an error output that reading from the socket failed which is related to question 3 above. Edit 4: -> By excluding errno ECONNRESET we can suppress the output in that case.
urmahp commented 3 years ago

Thank you very much @fmauch for looking into the issue and fixing it.

I apologize that I haven't time to look into the unit test yet, but I can look into it today if needed. What did you have in mind with the unit test? Was it something similar to the test_rtde_client test, testing that a UrDriver object can be created and testing that connection can be established between a robot and the client library or was it to test the tcp_server class you have implemented?

fmauch commented 3 years ago

I apologize that I haven't time to look into the unit test yet, but I can look into it today if needed. What did you have in mind with the unit test?

Nothing to apologize. I'd like to test the complete server structure. Do connects / disconnects get handled correctly? Do the correct callbacks get called? Are messages correctly transmitted? Are all error cases handled properly?

Additionally, the current keepalive implementation is not really satisfying. If we connect to the reverse interface, it will be happy as long as we don't disconnect. We should check whether we actually receive a keepalive signal. We should get a keepalive signal (basically any message currently) from the robot before we send the next message. I'd like to implement this using a unit test that should trigger a disconnect event when the keepalive signal is missing. @urmahp would you like to implement this?

urmahp commented 3 years ago

I will start developing some unit test to the server now, I will create a pr to your branch once I have some test. I will look into the keepalive implementation as well.

fmauch commented 3 years ago

@urmahp I'll have some time to work on this today and tomorrow. If you didn't start with the keepalive implementation, I'll start doing that.

urmahp commented 3 years ago

@urmahp I'll have some time to work on this today and tomorrow. If you didn't start with the keepalive implementation, I'll start doing that.

I haven't started the keepalive implementation.

I have added a PR to your branch on some unit test for the tcp server.

fmauch commented 3 years ago

I've created a draft implementation of the keepalive handling here. Let's keep discussions on this there.

fmauch commented 3 years ago

@urmahp If you could take another review over the PR that would be really nice.

urmahp commented 3 years ago

I will test how it is working and I will also look into the sigpipe error which made one of the tcp server test fail on this build.

fmauch commented 3 years ago

I will test how it is working and I will also look into the sigpipe error which made one of the tcp server test fail on this build.

I think, I found the error:

When running the test standalone with output in rare cases (I needed to run it up to 2500 times to trigger) I get this failure:

[ RUN      ] TCPServerTest.client_connections
/home/mauch/robot_folders/checkout/ur_robot_driver_split/catkin_ws/src/Universal_Robots_Client_Library/tests/test_tcp_server.cpp:319: Failure
Value of: server.write(client1_fd, data, len, written)
  Actual: true
Expected: false

I think, that this is the problematic code block:

void TCPServer::handleDisconnect(const int fd)
{
  URCL_LOG_DEBUG("%d disconnected.", fd);
  close(fd);
  if (disconnect_callback_)
  {
    disconnect_callback_(fd);
  }
  FD_CLR(fd, &masterfds_);

 [...]
}

Calling the disconnect_callback before closing the socket might lead to a situation where an application thinks that the socket is closed when it is actually not. Therefore, this check returns true, while it should return false. The actual send command seems to be happily feeding things in and returning success, but this actually results in a SIGPIPE event.

I'll leave the test looping over night and it should not have failed tomorrow morning.

fmauch commented 3 years ago

95000 successful tests later I would say that this was indeed the problem.

urmahp commented 3 years ago

95000 successful tests later I would say that this was indeed the problem.

Nice thanks for solving the issue.

I have tested that it works, and it is working very well :smiley:

I believe these three variable declaration could be deleted as they aren't used anymore in the UrDriver.

I think that when merging this line should be removed from the ROS driver, else it would log a warning every control cycle, because the ROS driver is still using the script from the resource folder in the ROS driver.

fmauch commented 3 years ago

95000 successful tests later I would say that this was indeed the problem.

Nice thanks for solving the issue.

It was me introducing the issue, anyway ;-) Thanks for writing the tests that showed that :-)

I believe these three variable declaration could be deleted as they aren't used anymore in the UrDriver.

Addressed in 587a584

I think that when merging this line should be removed from the ROS driver, else it would log a warning every control cycle, because the ROS driver is still using the script from the resource folder in the ROS driver.

Addressed in https://github.com/UniversalRobots/Universal_Robots_ROS_Driver/pull/346

fmauch commented 3 years ago

I cleaned up the branch, from my side this is ready to merge.

urmahp commented 3 years ago

I cleaned up the branch, from my side this is ready to merge.

I agree.