Polymetis server fails after two hours

Type of Issue

Select the type of issue:

[x] Bug report (to report a bug)
[ ] Feature request (to request an additional feature)
[ ] Tracker (I am just using this as a tracker)
[ ] Refactor request
[ ] Documentation Ask

Description

Hi all,

I'm currently running polymetis with a Franka robot and the Robotiq gripper. Everything is working as intended, except for when we run polymetis for longer periods of time (~2 hrs). We'd like to be able to run it continuously for many hours or even days as we are testing reinforcement learning algorithms

The issue seems to occur within the server / robot client with the following command launch_robot.py robot_client=franka_hardware robot_client.executable_cfg.robot_ip=172.16.0.2 And everything will work as intended for two hours but eventually crash

Current Behavior

When my training script crashes, I see the following output from the polymetis server

[2022-10-06 12:19:12.650] [error] Robot is unable to be controlled: libfranka: Move command aborted: motion aborted by reflex! ["cartesian_reflex"]
control_command_success_rate: 0.9
[2022-10-06 12:19:12.650] [warning] Performing automatic error recovery. This calls franka::Robot::automaticErrorRecovery, which is equivalent to pressing and releasing the external activation device.
[2022-10-06 12:19:12.650] [warning] Automatic error recovery attempt 1/3...
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767                              
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767                                                                                                          
Get called, buffer size 73764767                                                                
Get called, buffer size 73764767                                                                    
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
[2022-10-06 12:19:13.652] [warning] Robot operation recovered.
Get called, buffer size 73764767
Get called, buffer size 73764767
Get called, buffer size 73764767
[2022-10-06 12:19:13.655] [warning] Interrupted control update greater than threshold of 1000000000 ns. Reverting to default controller...
[2022-10-06 12:19:13.655] [info] Terminating custom controller, switching to default controller.
[2022-10-06 12:19:13.659] [warning] Tried to perform a controller update with no controller running.

Expected Behavior

Not entirely sure what's to be expected, but when the experiment is running well, and the server is able to start the controller / switch to the default controller. For example, all I had to do with restart my training script and the server (with the same log as above) was able to continue working as follows:

[2022-10-06 12:19:13.659] [warning] Tried to perform a controller update with no controller running.
Get called, buffer size 77448624
[2022-10-06 13:20:47.203] [info] Loaded new controller.
Setting Torch policy to terminated.
[2022-10-06 13:20:49.181] [info] Terminating custom controller, switching to default controller.
Get called, buffer size 77450816
Get called, buffer size 77450821
Get called, buffer size 77450840

Steps to reproduce

Not sure how to reproduce this other than maybe running a random policy for ~2hrs?

In the libfranka error message, cartesian_reflex refers to the robot end-effector exerting more force than the safety limits allow for. Common causes are the gripper bumping too hard into objects, or the user commanding the arm to move too fast. I would need some more context of your experiment to know exactly what's happening, but in large-scale data collection processes this is quite hard to prevent, especially if the arm is interacting with objects.

A common workaround is to use is_running_policy every time before sending pose updates. This checks if the robot has terminated the controller due to safety reasons, and if so you can recover by resetting and calling start_cartesian_impedance or start_joint_impedance again.

Hi,

Thanks for responding! I actually found a way around this and it was hidden within polymetis the whole time.

If anyone else runs into this issue instead of the launch robot command above, you should run the script at ~/fairo/polymetis/polymetis/python/scripts/persis_server.sh. This will make it so the server will be restarted if it gets killed, I ran this and was able to get the franka running for 14+ hours without issues!

Hello,

Sorry to re-open this issue, but it turns out I was wrong. I switched to a new task where the robot must pick up a cube through reinforcement learning (before we were doing goal reaching so no need for contact), and the error came up again.

Maybe for more context, the error gets thrown in the following function in my library. What's strange is that we are checking if the controller is not running and then starting impedance control but I'm guessing this is too naive an implementation. Do you have any thoughts on how to make this more robust, such that it never tries the update joint command without the cartesian controller running?

def update_pose(self, pos, angle):
        '''Expect [x,y,z], [yaw, pitch, roll]'''

        desired_pos = torch.Tensor(pos)
        desired_quat = torch.Tensor(euler_to_quat(angle))

        desired_qpos, success = self._ik_solver.compute(desired_pos, desired_quat)
        feasible_pos, feasible_quat = self._robot.robot_model.forward_kinematics(desired_qpos)
        feasible_pos, feasible_angle = feasible_pos.numpy(), quat_to_euler(feasible_quat.numpy())

        if not self._robot.is_running_policy():
            self._robot.start_cartesian_impedance()
        self._robot.update_desired_joint_positions(desired_qpos)

        return feasible_pos, feasible_angle

To clarify, the issue you were running into happens as follows:

Safety limits enforced by the hardware is triggered. (libfranka: Move command aborted: motion aborted by reflex! ["cartesian_reflex"])
Polymetis restarts robot control & terminates original controller as a safety mechanism. ([info] Terminating custom controller, switching to default controller.)
The next time you try to update the controller, there is no controller running and thus there is nothing to update. ([warning] Tried to perform a controller update with no controller running.)

Theoretically you should not be updating without a controller running anymore, since you are now always checking to see if a controller is running. Can you confirm whether you are still seeing the error message in 3.?

The source of the issue, which I explained in the previous comment, cannot be removed programmatically since it has to do with the limits of the hardware itself. You can try to make your policy less aggressive to prevent the issue, but I would advise to wrap another loop around your script to reset the robot and maybe throw away the data collected in the current episode as a recovery mechanism to ensure that your experiment continues running.

So I checked the server and I did get the error message in 3. I reran a training loop while printing is_running_policy() and the issue is that sometimes it returns true when the impedance controller is not running, which causes a crash. I believe this is a bug but I'm not sure how to fix it.

For anyone curious, my workaround is that I added a try-except clause to catch this failure mode, and then restart the server. Now I can run the script for many hours without it crashing due to the error above (although it's not ideal)

try:
            self._robot.update_desired_joint_positions(desired_qpos)
except:
            print('impedance controller failed, restarting it')
            self._controller_restart += 1
            print(f'controller reset tracker: {self._controller_restart}\n')
            self._robot.start_cartesian_impedance()
            self._robot.update_desired_joint_positions(desired_qpos)

facebookresearch / fairo