mujoco_ros_control getting reset while executing a plan causes execution to fail

abhishek47kashyap commented 1 year ago

Successfully computed plans sometimes result in failed executions: execution begins but stops short of the goal with these logs:

[ INFO] [1677603355.509089828, 14.138000000]: Execution request received
[ INFO] [1677603361.961577291, 15.421000000] [ros.mujoco_ros_control.mujoco_ros_control] [/mujoco_server]: Resetting mujoco_ros_control due to time reset
[ INFO] [1677603361.964809774, 15.423000000]: Controller 'effort_joint_trajectory_controller' successfully finished
[ WARN] [1677603361.965065657, 15.424000000]: Controller handle effort_joint_trajectory_controller reports status PREEMPTED
[ INFO] [1677603361.965187640, 15.424000000]: Completed trajectory execution with status PREEMPTED ...
[ INFO] [1677603361.965381800, 15.424000000]: Execution completed: PREEMPTED

When execution does proceed to completion, the logs are:

[ INFO] [1677603379.451140313, 18.963000000]: Execution request received
[ INFO] [1677603382.033941919, 19.475000000]: Controller 'effort_joint_trajectory_controller' successfully finished
[ INFO] [1677603382.314629639, 19.503000000]: Completed trajectory execution with status SUCCEEDED ...
[ INFO] [1677603382.314818036, 19.503000000]: Execution completed: SUCCEEDED

So mujoco_ros_control getting reset before execution completes is responsible for trajectory execution's status being reported as preempted, which does make sense. Why on some occasions does mujoco_ros_control get reset while an execution is underway? What is the "time reset" as stated in the log and what causes it?

I've got eval_mode set to false (and difficulty_level as 1, in case relevant but I'd be surprised if it is). I'm trying to move to a pose from the assembly_screw example:

    pose_goal = move_group.get_current_pose().pose    // robot EE when s4dx/assembly_server starts up
    pose_goal.position.x = 0.5
    pose_goal.position.y = -0.2
    pose_goal.position.z = 0.6

Any suggestions on how to overcome this or pointers on possible mistakes made by me would be very appreciated.

DavidPL1 commented 1 year ago

Thanks for reporting this issue.

What is the "time reset" as stated in the log and what causes it?

Our simulation can be reset by the user, which causes the ROS time to be reset to 0. This requires some special handling e.g. clearing message buffers among other things, because otherwise involved ROS components would ignore new messages with timestamps they have already seen. The log message present in your log is triggered when mujoco_ros_control detects that the current ROS time is older than the ROS time when it ran the last control update, concluding a time reset must have happened. This leads to preemption of the currently running task, as you also noted.

Why on some occasions does mujoco_ros_control get reset while an execution is underway?

This should not happen unless a simulation reset is triggered manually and ROS time is reset. In both your logs the ROS time keeps incrementing, so no actual reset is triggered.

We have to reproduce the issue on our side and backtrack the cause. You do seem to have a very low real-time factor of 0.19. This might be an indicator that this issue is related to high CPU load. I can update the docker server image for you to configure a target real-time factor, you could try to set it below your apparent upper bound of 0.19 and report if you are able to reproduce the error.

I've got eval_mode set to false (and difficulty_level as 1, in case relevant but I'd be surprised if it is). I'm trying to move to a pose from the assembly_screw example

It seems the issue is rooted somewhere in mujoco_ros_control or mujoco_ros. The loaded scenario and difficulty setting should not make a difference here.

We'll keep you posted. In the meantime, if you are able to pinpoint more details about when this error happens please report them in this issue.

abhishek47kashyap commented 1 year ago

.. a very low real-time factor of 0.19. This might be an indicator that this issue is related to high CPU load.

This helped, thanks! I might have gone overboard with the number of visualizations I had active:

Rviz with the MoveIt plugin
rqt_gui with 6 panels, showing RGBD info from the 3 cameras

With only Rviz open, plans get executed to completion every single time I tried. Interestingly, the real-time factor stayed around the same. My CPU is Intel Xeon E5-2630 v3 (32) @ 3.200GHz.

So I guess my follow-up question is: other than moving to another computer which has more hardware resources, is there anything else I can do to minimize/eliminate the chances of execution failing? It'd be a bummer for execution(s) to get preempted when, say, training a policy.

To reproduce the problem I was facing:

Fire up s4dx/assembly_server.
(assuming stress is installed, if not this then any tool that stresses the CPU) stress --cpu N where N is the number of cores to stress. I set N to 20. Make sure to Ctrl + C after step 3 completes.
rosrun assembly mre_arm_actuation.py where assembly is name of the ROS package and arm_actuation.py is the code shown below:
```
#!/usr/bin/env python3
```

""" https://github.com/DavidPL1/assembly_example/blob/main/docker/assembly_example_ros/nodes/assembly_screw """

import rospy from moveit_commander import MoveGroupCommander, RobotCommander from copy import deepcopy

if name == "main": rospy.init_node('MRE_arm_actuation')

move_group_arm = MoveGroupCommander('panda_arm')
robot = RobotCommander()

waypoints = []
pose_goal = move_group_arm.get_current_pose().pose
rospy.loginfo(f"Currently at pos: {pose_goal.position.x:.3f}, {pose_goal.position.y:.3f}, {pose_goal.position.z:.3f}, quat [xyzw]: {pose_goal.orientation.x:.6f}, {pose_goal.orientation.y:.6f}, {pose_goal.orientation.z:.6f}, {pose_goal.orientation.w:.6f}")
pose_goal.position.x = 0.5
pose_goal.position.y = -0.2
pose_goal.position.z = 0.6
waypoints.append(deepcopy(pose_goal))

rospy.loginfo("Waypoints:")
for wp in waypoints:
    rospy.loginfo(f"\t pos: {wp.position.x:.3f}, {wp.position.y:.3f}, {wp.position.z:.3f}, quat [xyzw]: {wp.orientation.x:.6f}, {wp.orientation.y:.6f}, {wp.orientation.z:.6f}, {wp.orientation.w:.6f}")

# Do some planning
rospy.loginfo("Creating plan..")
plan, fraction = move_group_arm.compute_cartesian_path(waypoints, 0.01, 0)
rospy.loginfo(f"\t created plan, fraction = {fraction}")

ref_state = robot.get_current_state()
plan = move_group_arm.retime_trajectory(
    ref_state,
    plan,
    velocity_scaling_factor=0.4,
    acceleration_scaling_factor=0.4,
    algorithm="time_optimal_trajectory_generation"
)

# Execute the plan
rospy.loginfo("Executing plan..")
success = move_group_arm.execute(plan, wait=True)
if success:
    rospy.loginfo(f"\t ..executed")
else:
    rospy.logerr(f"\t ..execution failed")

print("Exit")

DavidPL1 commented 1 year ago

As explained in ubi-agni/mujoco_ros_pkgs#20, this issue should be fixed by commit https://github.com/ubi-agni/mujoco_ros_pkgs/commit/24a61540a7e9dec1f84d45169550316a80fbd614. I'm about to build an updated docker image version and will notify you, once it is uploaded.

I ran some tests and am rather sure you should not experience any more preempts because of this particular error, but I'll leave the issue open until you have tested the new version and confirmed that the fix works.

DavidPL1 commented 1 year ago

Version 1.1 of the docker image was now added to the cloud you have been provided access to. Please let me know if everything now works for you or if you experience other issues. Heavy load should not trigger this anymore, so you should also be fine with all your visualizations active.

abhishek47kashyap commented 1 year ago

I can confirm version 1.1 does not have this problem. I verified with all my visualizations from before active and with 20 out of 32 cores stressed. Thank you for providing a new image with the hotfix.

DavidPL1 / assembly_example

mujoco_ros_control getting reset while executing a plan causes execution to fail #3