franka state message doesnt tell error recovery is needed after user stop

HumbertoE commented 1 year ago

Context

Robot: FP3 libfranka version: 0.10.0 franka_ros version: 0.10.1

Problem

We are developing a system with an FP3 that works autonomously. To handle errors produced in the robot, we have a ROS node subscribed to /franka_state_controller/franka_states reading franka_msgs.

To deal with users interruptions (pressing the button or activating a safety system) we check if the robot_mode is 5. We then call the /franka_control/error_recovery action to trigger the error recovery.

The problem is when the node is not running and the button is pressed and unpressed or a safety system is activated and deactivated. When the node starts running, it sees a normal state and doesn't trigger any error recovery. The robot then can't move until we trigger the error recovery.

Reading the current_errors and last_motion_errors is also not useful as even when the button is pressed, all errors are set to False.

In addition to that, it sometimes takes a very long time (more than 12 s) for the robot to be operational again after triggering an error recovery (although the user stop is gone). Also, the time it takes to recover from an error varies, which makes it very difficult to program a defined behavior.

Possible solutions

These are 2 options to deal with this problem:

Add a flag in the message indicating that the error flag is on and the error recovery is needed, or indicate this in a similar way.
Recover automatically after a user interruption as proposed in this pull request

FE-EnricoSartori commented 1 year ago

Hi @HumbertoE, we are fixing this issue merging the PR #279. I expect to be merge during next week.

For the problem of the error recovery taking long time instead, we would need more details to reproduce the issue. Some example code to reproduce the issue would be the best.

HumbertoE commented 1 year ago

Thank you very much @FE-EnricoSartori . This is very good news. We will test it when it's ready.

In regards to the error recovery taking a long time, if it's ok for you I will measure the time it takes to confirm an error, repeat this experiment a couple of times and make an issue to report the results together with the script of the ROS node that we use to detect and confirm errors (for example cartesian reflexes). I will just take away some stuff not necessary for the test from this script.

Maverobot commented 1 year ago

We merged the mentioned solution in PR https://github.com/frankaemika/franka_ros/pull/279. I would close this for now. If this issue still exists with the fix, feel free to reopen it.

Do you think the merged PR https://github.com/frankaemika/franka_ros/pull/279 also solves the problem of https://github.com/frankaemika/franka_ros/issues/232 for you?

HumbertoE commented 1 year ago

Ok. Is the dev branch in a stable state to test on it already or should we wait for a new release?

And regarding issue https://github.com/frankaemika/franka_ros/issues/232, I don't think so as PR https://github.com/frankaemika/franka_ros/pull/279 addresses auto-recovery from user interruption, but not the controller start failure not being handled properly and reported as successful when it was not

Maverobot commented 1 year ago

A new release would be nothing but a tag on it and pushing it to branches for different ros versions. The develop branch is in a stable state and can be used already.

And regarding issue https://github.com/frankaemika/franka_ros/issues/232, I don't think so as PR https://github.com/frankaemika/franka_ros/pull/279 addresses auto-recovery from user interruption, but not the controller start failure not being handled properly and reported as successful when it was not

You are right.

Maverobot commented 1 year ago

@HumbertoE This behavior in https://github.com/frankaemika/franka_ros/issues/232 has been existing since the release of franka_ros, if I am not mistaken. To make the behavior correct, it would require quite a lot of refactoring in the code base. If https://github.com/frankaemika/franka_ros/issues/232 is not blocking any of your work, I would give it a lower priority on our list.

Please give me more information/explanation if the issue https://github.com/frankaemika/franka_ros/issues/232 is still a blocker.

HumbertoE commented 1 year ago

We changed to the latest commit (https://github.com/frankaemika/franka_ros/commit/2d458abad7390bb73771c91787c361c2b5cfa6ad) in the devel branch, tested the system and it worked well. Now the error flag is automatically reset after the user stop is gone (the button is unpressed).

Thank you for your responses and for the fix.

HumbertoE commented 1 year ago

In regards to the issue https://github.com/frankaemika/franka_ros/issues/232, we have it classified as a fatal error still. It doesn't happen too often, but when there is a problem loading the controller (we switch often between controllers and it usually happens then) we can't know that there is a problem but the robot can't move.

In that case we have to either restart the system or trigger the controller load again. We could do the last one programatically if we could detect that the controller was not loaded properly, but without this information, our system autonomy would be compromised.

Maverobot commented 1 year ago

Thanks for the quick feedback.

but when there is a problem loading the controller (we switch often between controllers and it usually happens then) we can't know that there is a problem but the robot can't move.

If I understand it correctly, /franka_state_controller/franka_states/robot_mode == 2 tells if the controller is in charge of the robot. Can you please give me examples which cannot be covered by this check?

HumbertoE commented 1 year ago

Sorry for the late reply. I didn't see your reply since the issue is closed.

/franka_state_controller/franka_states/robot_mode == 2 tells if the controller is in charge of the robot

We do subscribe to this topic and check the robot mode, but it happened multiple times to us that the robot mode is 2, no error is logged, calling controller_manager/list_controllers says the position_joint_trajectory_controller is running, but the controller doesn't respond when trying to execute a trajectory.

This usually happens after we switch back from another controller and is fixed by manually "switching" to the position_joint_trajectory_controller through the controller_manager/switch_controller service. My guess is that the controller was not properly loaded, but I don't understand why any error is raised or something indicates this was the case.

I am not sure why this error happens, but it is also mentioned in the second point of this 232 comment.

Maybe we can continue this conversation on a issue for this topic. Should we do it in #232? Or I could also open a different issue and mention 232 on it.

Maverobot commented 1 year ago

@HumbertoE Thanks for the reply. Let's have a dedicated issue for the topic and mention 232 on it. This allows us to have clear overview of the issue that we want to solve.

Besides, it would be very helpful if you can provide a minimal example to reproduce this issue reliably.

HumbertoE commented 1 year ago

@Maverobot Thank you for the quick reply.

Now we don't have a way to replicate the issue since we haven't see what causes the controller to stop working, but I will try to find a way to make it fail with certainty or just switch the controllers very often and wait for it to fail and then report what we see (no error logged, robot_mode == 2, controller_manager says it is running).

Before making the issue, I will first try to find this example to reproduce the error. If in some time I'm not able to do it, I would like to still make the issue explaining what we observed and when and asked for your opinion there of what could we do.

Thank you in advance.

frankaemika / franka_ros