UniversalRobots / Universal_Robots_ROS_Driver

Universal Robots ROS driver supporting CB3 and e-Series
Apache License 2.0
765 stars 405 forks source link

Connection problems with UR5 CB3 robot #182

Open stefanz-tri opened 4 years ago

stefanz-tri commented 4 years ago

Hi!

I experience the following problems while moving a UR5 robot with MoveIt using the ur_robot_driver:

The system operates normally at first. After about 4 minutes (sometimes earlier) the robot starts reconnecting showing the following in the ROS log:

[ INFO] [1589881314.565348304]: Connection to robot dropped, waiting for new connection.
[ INFO] [1589881314.581968699]: Robot ready to receive control commands.

On the teach pendant the log shows "Stopping communication and control".

After the first reconnect the connection is unstable and more reconnects follow. While the system continues to operate after the first reconnect, it is much slower and more unreliable than before (e.g. positions are just not approached without any error message in the log). After a short while it is not really usable anymore.

This behaviour always occurs when running the system and always has the same timing more or less. I do not use the teach pendant during the operation or do any kind of manual robot move.

Does anyone have an Idea what could cause this? Do you think the hardware limitations described below are enough to cause these problems? (I have my doubts because the timing is always the same, the controller does not work normal after the reconnect and RTDE works normally.)

Setup

Robot: UR5 CB3 3.9.1.64192 ROS: Kinetic System: Ubuntu 16.04, Native (No VM) Hardware: Standard Laptop, no Realtime Network: USB-LAN Adaptor connected to Switch in Robot Cabinet URCap: ExternalControl 1.0.1 UR Driver: bf26ea5eca4f6c586da029abc91774481c470d28 (but I had the same problem with a version some weeks older)

Details

I analyzed the connectivity in Wireshark. It looks like the robot is stopping to send data and triggers a reset shortly after. RTDE (port 30004) is running normally and does not seem to be influenced. In the following Log 192.168.10.30 is the laptop and 192.168.10.11 is the robot:

No.     Source                Destination           Protocol Length Time           Info

Normal communication 
 115483 192.168.10.30         192.168.10.11         TCP      98     173.024550873  50001 → 43187 [PSH, ACK] 
 115484 192.168.10.11         192.168.10.30         TCP      450    173.032448174  30004 → 55408 [PSH, ACK] 
 115485 192.168.10.11         192.168.10.30         TCP      68     173.032523016  43187 → 50001 [PSH, ACK] 
 115486 192.168.10.30         192.168.10.11         TCP      66     173.040315475  55408 → 30004 [ACK] 

No more answers from robot, RTDE works normally
 115487 192.168.10.11         192.168.10.30         TCP      450    173.040398206  30004 → 55408 [PSH, ACK] 
 115488 192.168.10.30         192.168.10.11         TCP      66     173.047722898  55408 → 30004 [ACK] 
 115489 192.168.10.30         192.168.10.11         TCP      98     173.048168851  50001 → 43187 [PSH, ACK] 
 115490 192.168.10.30         192.168.10.11         TCP      98     173.048344938  50001 → 43187 [PSH, ACK] 

...

After ca. 0.04s acknowledge from robot but no new data
 115506 192.168.10.11         192.168.10.30         TCP      66     173.080883745  43187 → 50001 [ACK] 
...
 115522 192.168.10.11         192.168.10.30         TCP      66     173.121450032  43187 → 50001 [ACK] 

...
Reconnect 
 115524 192.168.10.11         192.168.10.30         TCP      66     173.128585759  43187 → 50001 [RST, ACK] 
 115525 192.168.10.11         192.168.10.30         TCP      74     173.128627142  43560 → 50001 [SYN] 
 115526 192.168.10.30         192.168.10.11         TCP      74     173.128670919  50001 → 43560 [SYN, ACK] 
...
 115528 192.168.10.11         192.168.10.30         TCP      66     173.128988553  43560 → 50001 [ACK] 
 115529 192.168.10.30         192.168.10.11         TCP      66     173.129014930  50001 → 43560 [RST, ACK] 
...
 115533 192.168.10.11         192.168.10.30         TCP      74     173.144398006  43561 → 50001 [SYN] 
 115534 192.168.10.30         192.168.10.11         TCP      74     173.144413118  50001 → 43561 [SYN, ACK] 
...
 115536 192.168.10.11         192.168.10.30         TCP      66     173.144840463  43561 → 50001 [ACK] 
...
 115539 192.168.10.30         192.168.10.11         TCP      98     173.153537823  50001 → 43561 [PSH, ACK] 
 115540 192.168.10.11         192.168.10.30         TCP      66     173.153921119  43561 → 50001 [ACK] 
...
 115542 192.168.10.11         192.168.10.30         TCP      68     173.160577332  43561 → 50001 [PSH, ACK] 
 115543 192.168.10.30         192.168.10.11         TCP      66     173.160611398  50001 → 43561 [ACK] 
...

Operation continues with different port
 115545 192.168.10.30         192.168.10.11         TCP      98     173.163634497  50001 → 43561 [PSH, ACK] 
...
 115547 192.168.10.11         192.168.10.30         TCP      68     173.168558669  43561 → 50001 [PSH, ACK] 
gavanderhoorn commented 4 years ago

Network: USB-LAN Adaptor [..]

please try and test things with a direct cable connection to the physical ethernet port in the controller cabinet.

If that works, I would heavily suspect the USB adapter.

connected to Switch in Robot Cabinet

do you mean the USB hub in the controller?

fmauch commented 4 years ago

At first, I would also suspect the USB adapter...

The log entry "Stopping communication and control" is written when the URScript running on the controller stops. This only happens, when no keepalive signal is received from the ROS pc for a certain amount of time (Currently, this is set to 20 ms here).

So, this could in fact be either a networking issue or an issue with too much load on the ROS machine.

stefanz-tri commented 4 years ago

Thanks for your feedback so far!

This morning I tested with a different laptop and an internal LAN adaptor and was running into the same issues. Also I tested different cables with no changes. Probably this hardware was also not really up to the task but it is all I have available right now (in the future the system will run on more appropriate hardware ;-) )

Having to much load is quite possible as I am doing a lot of RViz visualizations on the same machine. I will try to play around with the keepalive timeout and check if I can observe an effect. The script in the recources linked above is the one sent and executed on the UR right? So the changes in that file should have an effect immediately without rebuilding the workspace? Do I need to restart extcontrol?

fmauch commented 4 years ago

Yes, the linked script is being executed directly on the robot. It is sent to the robot, each time the program enters the "External Control" program node. In order to reload the altered file, simply restart the driver, as it is read initially.

A high load on the ROS machine can indeed be a huge problem, especially, when no realtime kernel is in place as recommended.

gavanderhoorn commented 4 years ago

This morning I tested with a different laptop and an internal LAN adaptor and was running into the same issues. Also I tested different cables with no changes. Probably this hardware was also not really up to the task but it is all I have available right now (in the future the system will run on more appropriate hardware ;-) )

A test to see whether the load is the cause here could be to run just the driver, and then try to use rqt_joint_trajectory_controller to move the robot.

If that does work, it would be a good indication that the load could have something to do with it.

If it is load, instead of changing the timeout values, you could also consider increasing the process priority of the driver node. This would seem to be a good idea anyway, as typically responsiveness (or liveness almost) of drivers would seem important to me.

You could try to see whether using a launch-prefix on the driver node to renice it works. See How to launch nodes with realtime priority on ROS Answers for an example.

stefanz-tri commented 4 years ago

I tested some of your ideas:

So far I did not get around to test the driver with increased process priority but will hopefully be able to do it tomorrow.

It seems like everything indeed boils down to the load on the PC (which is considerable using RViz). The current plan for the hardware setup is to have one machine only for the core functionality (driver, planner) and activate the visualization on a different device on demand.

Thanks again for your input.

gavanderhoorn commented 4 years ago

@fmauch: the driver currently tries to enable an RT priority at start, but if it fails, does not try to then request the next-highest priority iirc.

It may be worth considering adding that behaviour. Or at least make it possible to configure the driver to do that.