No error raised when controller stop working

frankaemika / franka_ros

ROS integration for Franka research robots

https://frankaemika.github.io

Apache License 2.0

364 stars 313 forks source link

No error raised when controller stop working #326

Open HumbertoE opened 1 year ago

HumbertoE commented 1 year ago

Context

System version: 5.2.1 libfranka version: 0.10.0 franka_ros version: 0.10.2 (unreleased), on commit 2d458abad7390bb73771c91787c361c2b5cfa6ad, one commit behind the state of develop when this issue was made. controllers used: position_joint_trajectory_controller and a custom controller . We switch from one to another during operation regularly. The "controller" used to interface with the robot state is used. MoveIt related packages: moveit: at commit f94f958337d4791a13ba48de25bac9c9c60d3ce4 moveit_msgs: at commit 2b9c921f0e519704f898ebaec1663a4dd25b5203 moveit_resources: at commit 0be98e1b263e823fa41b653359da582dc585c827 geometric_shapes: at commit 0991e05d2ddc8007691b3ee907814f3f95f111f7 rviz_visual_tools: at commit f94f958337d4791a13ba48de25bac9c9c60d3ce4 moveit_visual_tools: at commit f9585abc4900b998262a4abb6b355b4d55c65dfb moveit_tutorials: at commit 71ab5ee7d4c9db9b67363b6abb385fe33cd02ddb panda_moveit_config: at commit 625be99465d1cf897297f0fde414f90349e3015f moveit_calibration: at commit ff6918e380791d82ff859b251297ac36bf1e1b2b

Problem

Sometimes the position_joint_trajectory_controller controller stops working (most of the cases after a controller switch) and, to our best knowledge, it is not reported as such. When this happens:

No error is raised
Nothing abnormal is logged to the terminal
/franka_state_controller/franka_states/robot_mode is 2
calling controller_manager/list_controllers says the position_joint_trajectory_controller is running

But whenever a new trajectory is sent to move_group (moveit node) for the exectution, the controller is unresponsive and the trajectory execution fails with the following errors:

[ WARN] [1677695839.528682019] [/move_group]: Controller 'position_joint_trajectory_controller' failed with error GOAL_TOLERANCE_VIOLATED: panda_joint2 goal error -0.160081
[ WARN] [1677695839.528764014] [/move_group]: Controller handle position_joint_trajectory_controller reports status ABORTED
[ INFO] [1677695839.528796047] [/move_group]: Completed trajectory execution with status ABORTED ...
[ INFO] [1677695839.528885006] [/move_group]: Execution completed: ABORTED
[ INFO] [1677695839.529366444] [/move_group]: Received event 'stop'

The controller starts working normally again after switching the controllers again. To do this we:

Call the controller_manager/switch_controller service to switch to our custom controller. We do this because we can't make the position_joint_trajectory_controller run since it is "still running".
Call the controller_manager/switch_controller service to switch to the position_joint_trajectory_controller

The problem that we would like to be addressed is not the failure of the controller itself, but the fact that it is not reported as not working since we can't identify why the controller failed or do something about it automatically if it is not reported.

Steps to reproduce

We don't know the exact cause of the controller to fail when it happens to us, so to reproduce the error, I just made a script that moves a little bit back with the position_joint_trajectory_controller, then moves forward a little bit with our custom controller and repeats this 200 times. I made it very rough to provoke the controller failure more easily and I tested it twice and in both cases the error happened after not too long.

@Maverobot, we would prefer to send the script and custom controller privately. Could you tell me how should I share it with you? I can make a workspace with the script in one package and the controller in another one. My mail is h.estrada@organifarms.de

The only issue is that we also use a custom planner that uses some packages with some modifications and it could be complicated to share. The robot should only plan and move to a specific and given joint configuration. I will leave an empty method in the script move_to_this_position that should do this. Would it be possible for you to plan and move it with the default MoveIt tools or however is easier for you? Thank you in advance for your time.

Maverobot commented 1 year ago

Hi, thanks for the detailed description and explanations. You can send the stuffs to my email zheng.qu[at]franka.de. I will try to get it running next week and get back to you if questions rise.

HumbertoE commented 1 year ago

Thank you. Please let me know here or by e-mail if anything is unclear. I will prepare everything and send it today or tomorrow.

Maverobot commented 1 year ago

Analysis

We tried to reproduce the issue and found out that it is very likely being caused by a race condition in ros_controllers/JointTrajectoryController.

The current trajectory of the robot can be set by two sources:

In the starting method of JointTrajectoryController, the current trajectory is set to a trajectory consisting only positions of the current robot configuration. JointTrajectoryController provides a ROS action (/position_joint_trajectory_controller/follow_joint_trajectory/goal) for receiving desired motion trajectory, which is saved as the current trajectory.

In the attached script test_controller_switch.py, it can be seen

# Switch to position_joint_trajectory_controller
if self.controller_switch_interface.switch_to_traj_controller():
  rospy.loginfo("switch position_joint_trajectory_controller successful")
else:
  rospy.logerr("switch position_joint_trajectory_controller unsuccessful")

# Move manipulator to start position
self.move_to_start_position()

What the code is doing is

First switching the controller, where the starting method of JointTrajectoryController is called and the current trajectory is set to the "hold trajectory" consisting only the current joint configuration.
Then a new trajectory to the start position is sent via ROS action.

However, a another step 1.1 can happen between the 1. and 2. step above:

1.1 The controllers are reset due to franka_control_node, where the starting method of JointTrajectoryController is called again and the current trajectory is set to the "hold trajectory" consisting only the current joint configuration.

If the "set to hold trajectory" of the step 1.1 is done before 2. step then everything is working as expected. If later than 2. step, then the trajectory, received by ROS action, will be overwritten by the "hold trajectory".

Solutions

Option 1: The most simple solution/workaround is to add a sleep (e.g. rospy.sleep(0.1)) before self.move_to_start_position(). From step 1. to step 1.1 is ideally only 1ms delay, the sleep duration can very likely to reduced to much shorter. I did a local test of 2000 controller switches and the problem never occurs.
Option 2: More investigation to check if control_manager.update(now, period, true); is necessary.

Regardless of the above options, the race condition of setting trajectory via

ROS action
starting method of JointTrajectoryController

still remains unsolved. Maybe writing a ticket at ros_controllers can be also helpful.

HumbertoE commented 1 year ago

Thank you for the analysis.

If this race condition would happen, it would prevent the first sent trajectory to be overwritten as you explain, but then i think new trajectories sent after this would have no problem, which is not the case.

Because currently when a trajectory fails, if new trajectories are sent, they all fails too until the controllers are switched again.

In the current test case, if after this failure the script is ran again, first it will try to switch to to the position_joint_trajectory_controller, it will say it is already running, then it will send the trajectory goal to the action server and it will fail again with a message similar to: Controller 'position_joint_trajectory_controller' failed with error GOAL_TOLERANCE_VIOLATED: panda_joint2 goal error -0.160081

Maverobot commented 1 year ago

Because currently when a trajectory fails, if new trajectories are sent, they all fails too until the controllers are switched again.

I cannot reproduce this with the code I have.

Once the trajectory gets stuck, I

killed the test script with killall test_controller_switch.py
then tried to send another 2 trajectories via moveit python interface with the script below.
then observed that the trajectories was executed successfully. (except for the second trajectory oscillation appeared, probably related to #318 )

Am I missing something?

The script I used in the step 2 above:

#!/usr/bin/python3
import rospy
from of_controllers_interface import ControllerSwitchInterface

from sensor_msgs.msg import JointState
from std_msgs.msg import Float64MultiArray

from actionlib import SimpleActionClient
from sensor_msgs.msg import JointState
from trajectory_msgs.msg import JointTrajectoryPoint
from control_msgs.msg import (
    FollowJointTrajectoryAction,
    FollowJointTrajectoryGoal,
    FollowJointTrajectoryResult,
)

class TestOscillation:
    """Class to reproduce oscilation"""

    def __init__(self):
        rospy.loginfo("Initializing TestOscillation Object")

        # Define approach position
        self.start_position = [
            1.5254816564668539,
            -1.2848733483893107,
            -0.005716291560688334,
            -1.8838260559849414,
            0.05710966156236622,
            2.1689345894430696,
            -0.03773784149181313,
        ]

        self.end_position = [
            0.5254816564668539,
            -1.2848733483893107,
            -0.005716291560688334,
            -1.8838260559849414,
            0.05710966156236622,
            2.1689345894430696,
            -0.03773784149181313,
        ]

        # Start controller switch interface
        self.controller_switch_interface = ControllerSwitchInterface()

        # Initialize list of current joint states
        self.current_joint_states = []

        # Subscriber for robot joint states
        rospy.Subscriber("/joint_states", JointState, self.__callback_joint_states)

        # Create client for sending goal to position controller
        action = rospy.resolve_name(
            "position_joint_trajectory_controller/follow_joint_trajectory"
        )
        self.pos_client = SimpleActionClient(action, FollowJointTrajectoryAction)
        rospy.loginfo("move_to_start: Waiting for '" + action + "' action to come up")
        self.pos_client.wait_for_server()

        self.robot_state_topic = rospy.resolve_name("joint_states")

    def __callback_joint_states(self, joint_states_msg):
        """receives updated joint states and save them in an attribute"""
        self.current_joint_states = joint_states_msg.position

    def move_to_position(self, position):
        joint_state = rospy.wait_for_message(self.robot_state_topic, JointState)
        rospy.loginfo("move_to_position: Waiting for message on topic '" + self.robot_state_topic + "'")
        max_movement = max(
            abs(self.start_position[i] - pos)
            for i, pos in enumerate(joint_state.position)
        )

        point = JointTrajectoryPoint()
        point.time_from_start = rospy.Duration.from_sec(
            # Use either the time to move the furthest joint with 'max_dq' or 500ms,
            # whatever is greater
            max(max_movement / rospy.get_param("~max_dq", 0.5), 0.5)
        )
        goal = FollowJointTrajectoryGoal()
        goal.trajectory.joint_names = joint_state.name
        point.positions = position
        point.velocities = [0] * len(position)

        goal.trajectory.points.append(point)
        goal.goal_time_tolerance = rospy.Duration.from_sec(0.5)

        # rospy.logwarn("The movement to the start position has to be implemented")
        rospy.loginfo("Sending trajectory Goal to move into initial config")
        self.pos_client.send_goal_and_wait(goal)

        result = self.pos_client.get_result()
        if result.error_code != FollowJointTrajectoryResult.SUCCESSFUL:
            rospy.logerr(
                "move_to_start: Movement was not successful: "
                + {
                    FollowJointTrajectoryResult.INVALID_GOAL: """
                The joint pose you want to move to is invalid (e.g. unreachable, singularity...).
                Is the 'joint_pose' reachable?
                """,
                    FollowJointTrajectoryResult.INVALID_JOINTS: """
                The joint pose you specified is for different joints than the joint trajectory controller
                is claiming. Does you 'joint_pose' include all 7 joints of the robot?
                """,
                    FollowJointTrajectoryResult.PATH_TOLERANCE_VIOLATED: """
                During the motion the robot deviated from the planned path too much. Is something blocking
                the robot?
                """,
                    FollowJointTrajectoryResult.GOAL_TOLERANCE_VIOLATED: """
                After the motion the robot deviated from the desired goal pose too much. Probably the robot
                didn't reach the joint_pose properly
                """,
                }[result.error_code]
            )

        else:
            rospy.loginfo("move_to_start: Successfully moved into start pose")

    def move_back_and_forth(self, n):

        for i in range(n):
            rospy.loginfo(f"Repetition {i+1} out of {n}")
            self.move_to_position(self.start_position)
            self.move_to_position(self.end_position)

if __name__ == "__main__":
    # moveit_commander.roscpp_initialize(sys.argv)
    rospy.init_node("test_oscillation", disable_signals=True)

    test = TestOscillation()
    test.move_back_and_forth(1)

HumbertoE commented 1 year ago

The way you reproduce it looks good.For the execution we currently use the Move Group Python interface as shown in the MoveIt tutorial, but it should be the same as at the end of the day they also just send a request to the action server with their execute method.

Did you restart the franka control node or switch the controller after the trajectory fails and before executing the move_back_and_forth method?

Maverobot commented 1 year ago

Did you restart the franka control node or switch the controller after the trajectory fails and before executing the move_back_and_forth method?

No. I did not do anything else except the steps listed in my previous comment.

Did you manage to find out the reason why newly sent trajectory was not executed after "stuck" in your case?

HumbertoE commented 1 year ago

No, I don't know why this doesn't happen to you. It happens to us also in normal operation and even when sending more trajectories, all fail with a message similar to: Controller 'position_joint_trajectory_controller' failed with error GOAL_TOLERANCE_VIOLATED: panda_joint2 goal error -0.160081 I will try doing the controller switch test again until it fails and then execute your move_back_and_forth and see if I get the same results as you, but now due some changes in our hardware, I cannot test until Wednesday, maybe tomorrow's evening. I will do it as soon as I can. Could you try running the controller switch test and once it fails, run it again? In this case for us:

A switch to the position_joint_trajectory_controller is attempted, it says it is already running and do nothing.
It sends the trajectory goal to the action server and it fails again with a message similar to: Controller 'position_joint_trajectory_controller' failed with error GOAL_TOLERANCE_VIOLATED: panda_joint2 goal error -0.160081

If the second time the trajectory is executed successfully, then we are getting different results meaning that we might be doing something weird.

Also the trajectory during the controller switch could fail with a "velocity discontinuity" error. This is something different from changing the controller with the robot still moving that just happens because the test is a bit rough, but the failure we would like to reproduce is when it says it is ok but when sending a trajectory, it fails.

Maverobot commented 1 year ago

Could you try running the controller switch test and once it fails, run it again?

Yes. I tried that. After getting stuck, I killed the controller switch test script and ran it again. No error was shown and the motions were executed successfully (until the next "stuck").

but the failure we would like to reproduce is when it says it is ok but when sending a trajectory, it fails.

It would be very helpful if you can come up with code, with which we can reproduce it. So far the issue here is not really related to franka_ros.

I just sent you per email the modified test_controller_switch.py script which you can try on your setup.

HumbertoE commented 1 year ago

After getting stuck, I killed the controller switch test script and ran it again. No error was shown and the motions were executed successfully (until the next "stuck").

This is weird because for us it stop working. I will do the experiment again using your script and sending the trajectories directly to the position_joint_trajectory_controller as you did.

It would be very helpful if you can come up with code, with which we can reproduce it.

If the trajectories are still failing for us after using your script, I will see if I can catch the error where I want the controller switch test to stop to make sure it stops for the reason we want and send it to you. As I said now I can't run tests with the robot, but I will as soon as possible and write you back. Maybe tomorrow evening I can already test and send you my feedback. The latest, I will send it on Wednesday.

HumbertoE commented 1 year ago

Sorry for the time, I wanted to test thoroughly the script you sent me in comparison with the one I used, but I have good news.

With your script you sent me I was able to get the same result as you did, it failing some times (your theory of the race condition would explain this) but only for one sent trajectory, the ones sent afterwards was executed correctly until the error happened again.

It is different than the behavior I had before, when all sent trajectories failed after the first one failed and until the controller was switched. I could determine the difference.

In the script you sent me, the action used to request the robot to move is: position_joint_trajectory_controller/follow_joint_trajectory

For the execution we used the Move Group Python interface as shown in the MoveIt tutorial that uses their execute method to communicate with their action: execute_trajectory

The position_joint_trajectory_controller/follow_joint_trajectory action result is successful even when the robot doesn't move, which is why I added a check in the code after a movement to check if the achieved joint positions were the goal joint positions.

I also changed the code to use the execute_trajectory action. This one did give an error when the movement failed, but afterwards and for some reason that I don't know, failed the next sent movements until the controllers are switched again.

In case you want to reproduce the error, I will send you the modified script. The steps to reproduce the error are:

Run test_controller_switch.launch
Press enter after getting to the start configuration the first time and getting the message Robot and start position. Press enter to start moving back and forth.
Wait until the test ends with the error: move_to_start: Movement was not successful. Error code: -4 (or in case you modify the script to use the position_joint_trajectory_controller/follow_joint_trajectory action, it would end with move_to_start: Did not arrive to desired configuration)
Run test_controller_switch.launch
Observe how it fails again and again (unless you modify the script to use the position_joint_trajectory_controller/follow_joint_trajectory action)
Run switch_to_trajectory_controller.py
Run test_controller_switch.launch
Observe that the movement is now successful

The easiest way for us to solve the problem would be then:

Use instead the position_joint_trajectory_controller/follow_joint_trajectory action and develop some error handling
Write an issue in moveit_ros about the problems we are having with the execute_trajectory action

In this way the execution of trajectories won't stop working and the only problem would be the race condition you mentioned, which is a smaller problem and for which we could add a sleep, but also would be nice to ensure it doesn't happen.

In this regard, why does this happens? This reset sounds unnecessary, but I don't know much about the Franka control node implementation.

1.1 The controllers are reset due to franka_control_node, where the starting method of JointTrajectoryController is called again and the current trajectory is set to the "hold trajectory" consisting only the current joint configuration.

Thank you for modifying the script, reproducing the situation and for the quick responses you gave.

I will send you the python file in case you want to see it

HumbertoE commented 1 year ago

This issue is no longer a blocker for us. We will use the follow_joint_trajectory action and a sleep after changing the controller as recommended to solve it.

The remaining thing to do would be to evaluate if the reset done by franka_control_node is necessary. If not, taking it out would avoid race conditions and we could take the sleep away, but this is not a blocker and would not result in a fatal error for us anymore.

HumbertoE commented 1 year ago

I noticed that the position_joint_trajectory_controller/follow_joint_trajectory action has the same issue as the execute_trajectory action if goal_tolerances are specified in the sent goal.

Before I said that:

The position_joint_trajectory_controller/follow_joint_trajectory action result is successful even when the robot doesn't move

This is true only if the sent goal has no goal tolerances. I specified them (this is also done in the execute_trajectory action) and tighten the goal_time_tolerance and then the action returns as unsuccessful. When this happens, then any trajectory sent afterwards is also unsuccessful.

The most relevant code added was the following. I will also sent you @Maverobot the script by mail.

goal_tolerances = []
for joint_name in self.joint_names:
    goal_tolerance = JointTolerance()
    goal_tolerance.name = joint_name
    goal_tolerance.position = 0.01
    goal_tolerance.velocity = 0.1
    goal_tolerance.acceleration = 0.1
    goal_tolerances.append(goal_tolerance)

goal = FollowJointTrajectoryGoal()

goal.goal_tolerance = goal_tolerances
goal.goal_time_tolerance = rospy.Duration.from_sec(0.1)

This is not critical for us as we will use the position_joint_trajectory_controller/follow_joint_trajectory action WITHOUT goal tolerances and add our own check to see if it arrives to the desired position, but it might affect if the controller is unsuccessful for other reason.

I am not sure if this is a problem from ros_controllers or the hardware interface between the controller and the robot. I can make an issue in ros_controllers if you think it is the first one.

Maverobot commented 1 year ago

I can reproduce the issue with the script you sent me. But it does not seem to be related to goal.goal_tolerance as no error occurs when I add the code snippet of your comment into the script. I assume it was caused by some other changes of you. For convenience, I will send you the complete script per email.

HumbertoE commented 1 year ago

Thanks for the quick reply

Doesn't the action return a FollowJointTrajectoryResult.INVALID_GOAL result in your code when the robot doesn't move (due to the race condition)? This is from when trajectories start failing.

The main difference I can see is that you send a 1 point trajectory and I include also the initial point.

I don't get why the different behavior and I can't test much now, but I will as soon as I can.

HumbertoE commented 1 year ago

Hello @Maverobot ,

as we agreed before, we were using:

the position_joint_trajectory_controller/follow_joint_trajectory action WITHOUT goal tolerances and add our own check to see if it arrives to the desired position

But the error of the trajectory execution failing once and then getting stuck and failing for any other sent trajectory happened again.

Because of this, in the last test we added goal tolerances, but very high ones:

self.goal_tolerances = []
for joint_name in self.joint_names:
    goal_tolerance = JointTolerance()
    goal_tolerance.name = joint_name
    goal_tolerance.position = 6
    goal_tolerance.velocity = 6
    goal_tolerance.acceleration = 6
    self.goal_tolerances.append(goal_tolerance)

goal = FollowJointTrajectoryGoal()
goal.goal_tolerance = self.goal_tolerances
goal.goal_time_tolerance = rospy.Duration.from_sec(0.5)

This works most of the time, but in the lab we already saw the problem occurring again.

As I said before, I think the problem happens when the controller itself fails, so when the controller itself says the trajectory fails because the tolerances were not met. When this happens, for some reason all further sent trajectories fail too.

What we were doing so far was letting the tolerances very loose or not specified so the controller itself won't think it fails, but this is not a good enough solution since the error can already happen as it happened already to us.

Could you please help us seeing how to make the controller able to don't get stuck after failing once?

Maverobot commented 1 year ago

Hi Humberto,

it turns out to be indeed a bug in the franka_ros. Thanks again for reporting it!

The issue was caused by writing/reading to same memory from different threads without using any synchronisation mechanism (mutex, atomic etc). It should be now fixed by the commit https://github.com/frankaemika/franka_ros/commit/b27e85734457ffd45102cc55def74535215ac96c which has been merged into develop branch.

HumbertoE commented 1 year ago

Hello @Maverobot,

It's very nice that you found that error. This week I can't test in our physical system, but I will prepare everything to test when I am able to.

One question in regards of exactly what this solves. There were 2 problems:

The controller fails a trajectory. This was caused by a race condition or that was the theory we stated in this conversation and can be avoided by waiting a bit. This wait was ignored on purpose on the minimal example to trigger the error.
Once the controller fails once, all the following sent trajectories fail too. This was the most critical part.

Which part does this fix would solve?

Again thanks for finding this problem

Maverobot commented 1 year ago

Hi Humberto, the bug which I fixed was (2). To be exact, the old controller sometimes never finished after controller switching which resulted in the observation that the controller fails to reach a certain goal because the command signals for the new controller were not processed by the old controller.

HumbertoE commented 1 year ago

Hello Zheng,

That is very nice. We will test it when we are back from field testing. Thank you.

About the problem 1, the race condition, it would be ideal if it could be fixed (not sure if the last update is necessary for the ROS controller) so investigation would be required as you stated in this comment:

Option 2: More investigation to check if control_manager.update(now, period, true); is necessary.

But the most critical part was the part 2.