DavidPL1 / assembly_example

Example code for interaction with our assembly simulation ICRA 2023 challenge
7 stars 1 forks source link

'effort_joint_trajectory_controller' randomly fails to load when resetting scene several times in a row #15

Closed abhishek47kashyap closed 1 year ago

abhishek47kashyap commented 1 year ago

I've found effort_joint_trajectory_controller failing to load on some occasions when the service /reload_with_new_scene is called several times in a row. This very well might be related to what's described in a previous comment:

sometimes tearing down the controller manager causes the simulation to hang. I'm not sure where this comes from (might be related to not destroying the controller manager interface during the call to the reload service),

And every time this failure occurs, this collision log shows up:

[ WARN] [1680870631.261040172, 0.698000000] [ros.assembly_score_plugin.assembly_score_plugin] [/mujoco_server]: Assembly failed due to collision with the robot or an object on the floor!
Full logs ```[ INFO] [1680870630.526342784, 5.673000000] [ros.assembly_score_plugin.assembly_score_plugin] [/mujoco_server]: Start triggered! [DEBUG] [1680870630.529058187, 5.673000000] [ros.mujoco_ros] [/mujoco_server]: PauseCB called with: 0 [DEBUG] [1680870630.562749807, 5.673000000] [ros.mujoco_ros.mujoco_env] [/mujoco_server]: Applied model to context [ INFO] [1680870630.563941100, 0.001000000] [ros.mujoco_ros_control.mujoco_ros_control] [/mujoco_server]: Resetting mujoco_ros_control due to time reset [DEBUG] [1680870630.564009187, 0.001000000] [ros.mujoco_ros_control.mujoco_ros_control] [/mujoco_server]: sim time is 0.001000000 while last time was 5.673000000 [ WARN] [1680870630.564421033, 0.001000000]: Moved backwards in time (probably because ROS clock was reset), re-publishing joint transforms! [ WARN] [1680870630.601552212, 0.038000000]: Detected jump back in time of 5.635s. Clearing TF buffer. [ WARN] [1680870630.601928514, 0.039000000] [ros.tf2_ros] [/mujoco_server]: Detected jump back in time of 5.634s. Clearing TF buffer. [ INFO] [1680870630.651657236, 0.089000000] [ros.franka_mujoco.FrankaGripperMujoco] [/mujoco_server]: Successfully initialized Franka Gripper Controller for joints 'panda_finger_joint1' and 'panda_finger_joint2' [ WARN] [1680870631.261040172, 0.698000000] [ros.assembly_score_plugin.assembly_score_plugin] [/mujoco_server]: Assembly failed due to collision with the robot or an object on the floor! [INFO] [1680870631.271580]: Assembly failed [INFO] [1680870631.271889]: (-1.0, -1.0, -1.0, -1.0, -1.0) WARNING: Nan, Inf or huge value in QACC at DOF 15. The simulation is unstable. Time = 0.7120. ```

On the terminal where I'm running my own node, these are the logs after which it freezes:

Loaded 'franka_gripper'
Loaded 'franka_state_controller'

These are when controller loading goes through successfully:

Loaded 'franka_gripper'
Loaded 'franka_state_controller'
Loaded 'effort_joint_trajectory_controller'
Started ['franka_gripper', 'franka_state_controller', 'effort_joint_trajectory_controller'] successfully

I've not been able to predict when might this failure occur. It doesn't occur after a fixed number of reloads either. The highest number of back-to-back reloads I could get is 30, and I believe I just got lucky because other times the first few reloads would be successful and then randomly effort_joint_trajectory_controller fails to load.

As mentioned in #6 my objective is to collect data and create a dataset.

What's the recommended workaround for effort_joint_trajectory_controller failing to load on scene resets?

DavidPL1 commented 1 year ago

I've found effort_joint_trajectory_controller failing to load on some occasions when the service /reload_with_new_scene is called several times in a row. This very well might be related to what's described in a previous comment:

sometimes tearing down the controller manager causes the simulation to hang. I'm not sure where this comes from (might be related to not destroying the controller manager interface during the call to the reload service),

Are you sure the controller is not loaded? The timestamps in your log suggest that the reload is successful. If the above mentioned bug was the cause, the simulation would hang before the time is reset to 0.

[ WARN] [1680870631.261040172, 0.698000000] [ros.assembly_score_plugin.assembly_score_plugin] [/mujoco_server]: Assembly failed due to collision with the robot or an object on the floor! as well as WARNING: Nan, Inf or huge value in QACC at DOF 15. The simulation is unstable. Time = 0.7120. suggest that some collision occurred and very high speed and/or accelerations are applied.

Do the failure cases by any chance only happen if reloads are triggered after the align screw into nut goal is met?

abhishek47kashyap commented 1 year ago

Sorry, coming back to this after a while.

Are you sure the controller is not loaded? The timestamps in your log suggest that the reload is successful. If the above mentioned bug was the cause, the simulation would hang before the time is reset to 0.

I see. It's possible I'm mistaken. Since I noticed the logs not going past

Loaded 'franka_gripper'
Loaded 'franka_state_controller'

on failures I thought it's failing to load effort_joint_trajectory_controller which happens to be the next log when scene reloads successfully.

Do the failure cases by any chance only happen if reloads are triggered after the align screw into nut goal is met?

I was reloading over and over again to collect RGBD data and the robot was not at all being moved. So no sub-tasks were being completed.

DavidPL1 commented 1 year ago

Alright, I'll try to replicate the error by just reloading. How far are your reloads usually apart?

abhishek47kashyap commented 1 year ago

Reloads are usually 0.3 - 0.4 seconds apart. That's how long it takes to collect the data I want out of a scene without ever moving the robot.

Another time this problem occurred, the last log was Loaded 'franka_gripper' but there was no Loaded 'franka_state_controller'. So whatever the cause of this issue is, it's seemingly not related to effort_joint_trajectory_controller.

Regarding a previous statement I had made:

highest number of back-to-back reloads I could get is 30

I could do 50 reloads in a row (and very likely could keep going), maybe something changed related to this problem in the newer versions .. not sure ..


Also, seems like whenever this problem occurs, there is a jump back in time as the controllers are reloading. Here are some logs from the terminal where I run my data-collection script:

Example logs 1 ``` [INFO] [1682248706.423287, 0.469000]: Reloading scene by calling /reload_with_new_scene.. [INFO] [1682248706.434905, 0.469000]: Scene reload successful [INFO] [1682248706.437081, 0.469000]: Waiting for service /user_ready [ WARN] [1682248709.544328116]: Detected jump back in time of 0.469s. Clearing TF buffer. [WARN] [1682248709.544812, 0.000000]: Detected jump back in time of 0.463000s. Clearing TF buffer. [INFO] [1682248709.764885, 0.000000]: .. found, firing request [INFO] [1682248709.771001, 0.000000]: good to go! [WARN] [1682248709.775255, 0.000000]: Reloading controllers .. Loaded 'franka_gripper' [ WARN] [1682248710.200324209, 0.037000000]: Detected jump back in time of 0.03s. Clearing TF buffer. [WARN] [1682248710.207731, 0.032000]: Detected jump back in time of 0.035000s. Clearing TF buffer. ```
Example logs 2 ``` [INFO] [1682250690.035657]: Reloading scene by calling /reload_with_new_scene.. [INFO] [1682250690.044197]: Scene reload successful [INFO] [1682250690.046120]: Waiting for service /user_ready [ WARN] [1682250693.055056185]: Detected jump back in time of 0.536s. Clearing TF buffer. [INFO] [1682250693.076581]: .. found, firing request [INFO] [1682250693.087494]: good to go! [WARN] [1682250693.094033]: Reloading controllers .. Loaded 'franka_gripper' Loaded 'franka_state_controller' [ WARN] [1682250693.592703780, 0.037000000]: Detected jump back in time of 0.03s. Clearing TF buffer. [ WARN] [1682250693.657635675, 0.040000000]: Detected jump back in time of 0.031s. Clearing TF buffer. ```

For comparison, here are logs when there are is no jump back in time as controllers are reloaded:

Example logs ``` [INFO] [1682252701.676782, 6.901000]: Reloading scene by calling /reload_with_new_scene.. [INFO] [1682252701.684548, 6.901000]: Scene reload successful [INFO] [1682252701.686489, 6.901000]: Waiting for service /user_ready Exception in thread Thread-55: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/opt/ros/noetic/lib/python3/dist-packages/rospy/timer.py", line 228, in run r.sleep() File "/opt/ros/noetic/lib/python3/dist-packages/rospy/timer.py", line 103, in sleep sleep(self._remaining(curr_time)) File "/opt/ros/noetic/lib/python3/dist-packages/rospy/timer.py", line 163, in sleep raise rospy.exceptions.ROSTimeMovedBackwardsException(time_jump) rospy.exceptions.ROSTimeMovedBackwardsException: ROS time moved backwards [ WARN] [1682252704.928957681]: Detected jump back in time of 6.901s. Clearing TF buffer. [WARN] [1682252704.929457, 0.000000]: Detected jump back in time of 6.893000s. Clearing TF buffer. [INFO] [1682252705.018850, 0.000000]: .. found, firing request [INFO] [1682252705.040416, 0.000000]: good to go! [WARN] [1682252705.046488, 0.000000]: Reloading controllers .. Loaded 'franka_gripper' Loaded 'franka_state_controller' Loaded 'effort_joint_trajectory_controller' Started ['franka_gripper', 'franka_state_controller', 'effort_joint_trajectory_controller'] successfully [INFO] [1682252706.154190, 0.201000]: reload_controllers() succeeded ```

Later edit: the traceback isn't always there.

Hope the information helps.