UniversalRobots / Universal_Robots_ROS_Driver

Universal Robots ROS driver supporting CB3 and e-Series
Apache License 2.0
767 stars 405 forks source link

Main program on UR robot using ExternalControl URCap stops executing #196

Closed guiolpei closed 1 year ago

guiolpei commented 4 years ago

Summary

Every once in a while, communication between the UR robot and the robot driver is closed. The main program execution on the robot side is stopped (without any notification) and it must be run again (press "Play" button on robot panel).

The program defined in the robot contains only the ExternalControl URCap.

On the PC side, the command executed in the following:

roslaunch ur_robot_driver ur10e_bringup.launch robot_ip:=XXX.XXX.XXX.XXX kinematics_config:=/Y/ur10e-Z.yaml

Versions

Issue details

We are running a non-realtime Linux kernel, this might be related. We have a single PC connected via Ethernet through a switch to the robot. The Linux box is running RViz and processes images obtained from other nodes in the network.

Will update with logs and messages the next time it happens.

Related issues

182

gavanderhoorn commented 4 years ago

Since you mention #182 yourself: have you tried increasing the priority of the driver process and has that had any effect?

guiolpei commented 4 years ago

Can you explain exactly how is the correct way to do this?

gavanderhoorn commented 4 years ago

Does the link to the ROS Answers Q&A in https://github.com/UniversalRobots/Universal_Robots_ROS_Driver/issues/182#issuecomment-631329522 provide sufficient detail?

guiolpei commented 4 years ago

Can this be done from the command line when executing the roslaunch command?

roslaunch ur_robot_driver ur10e_bringup.launch robot_ip:=XXX.XXX.XXX.XXX kinematics_config:=/Y/ur10e-Z.yaml

gavanderhoorn commented 4 years ago

No, you'll need to change the ur_control.launch file. Specifically, this line:

https://github.com/UniversalRobots/Universal_Robots_ROS_Driver/blob/638e92f543755b25b9cea0dfa86d21e05a5766ae/ur_robot_driver/launch/ur_control.launch#L32

and add the launch-prefix to the node element.


Edit: o wait, I see that launch-prefix is actually an arg of that .launch file. @fmauch: can we use that to pass the required nice command?

Edit2: o wait again: that is already used for the debug arg. So that won't work.

@guiolpei: you'll want to remove what is there in launch-prefix currently and use the nice command described in the ROS Answers Q&A.

But this is all just a test.

fmauch commented 4 years ago

However, it would be very interesting whether this test leads to an improvement.

gavanderhoorn commented 4 years ago

Ok, sorry, I was too quick.

@guiolpei: as the ROS Answers Q&A also mentions, you'd need to run nice with sudo to give a process a higher priority.

So unless you've enabled passwordless sudo for your user and/or that command (ie: nice) this won't work.

I'm also not sure whether process priorities are inherited by child processes spawned by roslaunch, so running the entire .launch file with a higher priority might also not work.


Edit: a quick Google for "nice without sudo" directs me to How can I allow a user to prioritize a process to negative niceness?. That should not be too difficult to configure (essentially editing the mentioned configuration file in /etc), and would allow nice to set higher priorities without sudo for a specific user only.

guiolpei commented 4 years ago

Thank you both for your answers.

I have modified /etc/security/limits.conf to allow my user to assign higher priorities.

and added a launch-prefix of nice -n -20 in the ur_control.launch file.

Now htop shows a value of PRI=0 and NI=-20 for the ur_driver node.

I will check operation using this configuration to see if problem persists.

gavanderhoorn commented 4 years ago

@fmauch: if this works, we could consider making the hardware interface node request the next highest priority if configuring the RT priority fails. Reniceing is nice, but not the same as a proper priority.

It would probably be a good idea to do that anyway.

@guiolpei: I've also updated the ROS Answers Q&A with this information.

fmauch commented 4 years ago

Yes, I agree.

guiolpei commented 4 years ago

Still timing out:

[ INFO] [1592308489.822836696]: Robot requested program [ INFO] [1592308489.822982410]: Sent program to robot [ INFO] [1592308489.993261828]: Robot ready to receive control commands. [ INFO] [1592308609.428015318]: Connection to robot dropped, waiting for new connection. [ERROR] [1592308610.146706968]: Can't accept new action goals. Controller is not running.

Maybe roscore should also run with higher priority?

gavanderhoorn commented 4 years ago

roscore has nothing to do with this.

fmauch commented 4 years ago

The roscore is not part of the communication between the robot and the driver.

Another way to find the source of package drop would be to

guiolpei commented 4 years ago

At the moment, we can't run a dedicated PC only for the driver and direct connection is not possible because the PC has to communicate with other nodes in the network.

Is there a way to control the value of this communication timeout (if it is indeed a timeout)?

gavanderhoorn commented 4 years ago

At the moment, we can't run a dedicated PC only for the driver and direct connection is not possible because the PC has to communicate with other nodes in the network.

I would say what @fmauch suggests are ways to diagnose what the cause is.

Not suggestions for system configuration in a final/production environment.

fmauch commented 4 years ago

I meant temporarily. If the timeout occurs always rather quickly (5 minutes runtime or something like this), you could just let it run for 15 minutes without the rest of your application. If you encounter the problems there, as well, it's likely a network issue (Though in the other case it could also be a network issue. If I understand it correctly, you have different PCs taking part in the application? Are they going over the same switch? This could be even worse than the control PC's load. )

guiolpei commented 4 years ago

There are several PCs in the application, all over the same switch. Maybe the best thing would be connecting the robot to the control PC directly with a dedicated network card.

We will continue testing and I will report back with any news.

Thank you both for your help!

guiolpei commented 4 years ago

After some testing, it seems it is not a network issue. It is due to high load on the control PC, reducing the number of tasks on it reduces the frequency of dropped connections significantly.

fmauch commented 4 years ago

Because you asked earlier: You could increase the number of missed packages here, but I would not recommend it. Robot motions will change, as in those cases linear extrapolation will take place. Increasing this will only hide the resource problem that you have with your control PC with the cost of undeterministic behavior of your robot motions..

gavanderhoorn commented 4 years ago

It is due to high load on the control PC

So switching to an RT kernel would fix this (as the driver requests a sufficiently high priority other processes should not be able to interfere any more), but it might import other issues (as some drivers fi don't work with RT kernels).

If it's really load, I would expect an increased priority for the driver process to help. I'm not sure I understand how the approach with nice doesn't work. Unless you have other processes which have an equal or higher priority.

fmessmer commented 4 years ago

I did not really read all through the discussion (yet), but the problem seems familiar to what we experienced in a similar setup

Our solution (for now) was to increase the timeout in https://github.com/UniversalRobots/Universal_Robots_ROS_Driver/blob/master/ur_robot_driver/resources/ros_control.urscript#L107 from 0.02 to 0.04

I don't know all the details about it, but just want to mention it here...


Should have read the threads before posting things... :facepalm: I see this suggestion has been proposed in https://github.com/UniversalRobots/Universal_Robots_ROS_Driver/issues/182#issuecomment-631237843 already

guiolpei commented 4 years ago

@fmessmer Thank you for your comment, I have not actually tried this.

I don't know if this can mitigate the problem even with a high load on the control PC. Maybe @fmauch or @gavanderhoorn could explain its implications and I could try it out to see if it makes a difference.

fmauch commented 4 years ago

Basically, what I suggested and what @fmessmer suggested have the same implications.

While my change increases the number of allowed timeout reads, changing the timeout increases the maximum time a cycle could take.

github-actions[bot] commented 1 year ago

This issue has not been updated for a long time. If no further updates are added, this will be closed automatically. Comment on the issue to prevent automatic closing.

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity. Feel free to comment or reopen if this is still relevant.