Interbotix / interbotix_ros_toolboxes

Support-level ROS Packages for Interbotix Robots
BSD 3-Clause "New" or "Revised" License
29 stars 68 forks source link

[Bug]: Race condition in xs_robot/arm.py and potentially other places using spin_once_until_future_complete #55

Closed tshumay closed 11 months ago

tshumay commented 11 months ago

What happened?

I was noticing some odd behaviour on a robot I was working on that is derived from the Interbotix arms. Specifically, when using the joystick control / launch file with a PS4 controller (xsarm_joy.launch.py), it appeared that the velocity and acceleration overrides for waking and sleeping the robot only worked sometimes. The intended scenario seems to be that the joystick configuration sets default velocity and acceleration to 200/100, but the wake and sleep calls override those values to 1500/750 to avoid moving too fast.

When the service call to set the vel/accel fails, the robot arm attempts to perform the wake/sleep at the 200/100 rate which is really fast, and sometimes causes motor faults.

Robot Model

Custom Dynamixel robot - similar to WindowX

Operating System

Ubuntu 22.04

ROS Distro

ROS 2 Humble

Steps To Reproduce

On my ROS2 Humble install, I found that just waking and sleeping the robot a number of times with the game controller after launching the xsarm_joy.launch.py script would ultimately surface the bug. Sometimes, it is subtle because it may be the case that the velocity registers are written, and then the position, and then acceleration, so you don't notice it as badly. It's the worst case if both velocity and and acceleration register writes are deferred until after the position command though because the robot will move very quickly to/from home/sleep.

Relevant log output

I enabled debug log level on the launch file and noticed that the log messages were not always happening in the right order to indicate that the vel/accel had been updated properly before the move commands. When it works, it looks like this:

[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 1, writing position command 2048.
[xs_sdk-1] [DEBUG] ID: 2, writing position command 783.
[xs_sdk-1] [DEBUG] ID: 4, writing position command 3031.
[xs_sdk-1] [DEBUG] ID: 6, writing position command 2048.
[xs_sdk-1] [DEBUG] ID: 7, writing position command 2536.
[xs_sdk-1] [DEBUG] ID: 8, writing position command 2048.
[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Acceleration, value: 100.

But, sometimes I see things out of sequence, such as this, where the position command gets written in between the two register write commands, which really isn't supposed to happen: 

[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Velocity, value: 1500.
<< This block is out of sequence >>
[xs_sdk-1] [DEBUG] ID: 1, writing position command 2048.
[xs_sdk-1] [DEBUG] ID: 2, writing position command 783.
[xs_sdk-1] [DEBUG] ID: 4, writing position command 3031.
[xs_sdk-1] [DEBUG] ID: 6, writing position command 2048.
[xs_sdk-1] [DEBUG] ID: 7, writing position command 2536.
[xs_sdk-1] [DEBUG] ID: 8, writing position command 2048.
<<>>
[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Acceleration, value: 100.

Worst case scenario is this, where both the velocity and acceleration commands happen after the move:

<< This is extra bad >>
[xs_sdk-1] [DEBUG] ID: 1, writing position command 2048.
[xs_sdk-1] [DEBUG] ID: 2, writing position command 783.
[xs_sdk-1] [DEBUG] ID: 4, writing position command 3031.
[xs_sdk-1] [DEBUG] ID: 6, writing position command 2048.
[xs_sdk-1] [DEBUG] ID: 7, writing position command 2536.
[xs_sdk-1] [DEBUG] ID: 8, writing position command 2048.
<<>>
[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Velocity, value: 1500.
[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Acceleration, value: 750.
[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Velocity, value: 200.
[xs_sdk-1] [DEBUG] ID: 1, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 2, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 4, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 6, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 7, writing reg: Profile_Acceleration, value: 100.
[xs_sdk-1] [DEBUG] ID: 8, writing reg: Profile_Acceleration, value: 100.

Additional Info

I believe the culprit is the usage of self.core.executor.spin_once_until_future_complete() in the code below. I believe this function actually literally only does spin once, and thus, does not guarantee completion of the future. It's very poorly named in ROS, but I believe it is normally used internally by spin_until_future_complete(), which does ensure completion. I did a few tests, by placing an assert in the code afterward to ensure the future was done, and the code tripped the assert (assert future_moving_time.done() after the spin_once_until_future_complete() call).

def set_trajectory_time(
        self,
        moving_time: float = None,
        accel_time: float = None
    ) -> None:
        """
        Command the 'Profile_Velocity' and 'Profile_Acceleration' motor registers.

        :param moving_time: (optional) duration in seconds that the robot should move
        :param accel_time: (optional) duration in seconds that that robot should spend
            accelerating/decelerating (must be less than or equal to half the moving_time)
        """
        self.core.get_logger().debug(
            f'Updating timing params: {moving_time=}, {accel_time=}'
        )
        if moving_time is not None and moving_time != self.moving_time:
            self.moving_time = moving_time
            future_moving_time = self.core.srv_set_reg.call_async(
                RegisterValues.Request(
                    cmd_type='group',
                    name=self.group_name,
                    reg='Profile_Velocity',
                    value=int(moving_time * 1000),
                )
            )
            **self.core.executor.spin_once_until_future_complete(**
                future=future_moving_time,
                timeout_sec=0.1
            )
...

It's easy enough to fix this up, but there are a few other concerns here as well - the timeouts are another one. It's also theoretically possible for the calls to the register updates to timeout and fail, but the calling function won't know that they failed and will try to execute subsequent commands which may be undesirable.

Also, there seems to be a fair amount of usage of spin_once_until_future_complete in core.py, which might have similar issues. So, probably worth reviewing the other potential usage of it as well to make sure none of those cases are critical.

lukeschmitt-tr commented 11 months ago

This is a great find. Thanks for the comprehensive report. We'll look into this on our end and implement a fix.

lukeschmitt-tr commented 11 months ago

Please see if the changes in https://github.com/Interbotix/interbotix_ros_toolboxes/commit/6280950f8f6518e2c76e2f6134a3700423f70c7b fix the timing issue you describe.

For this change, we've also removed the timeout so the spinning will block until the futures are complete. This was done so users could be sure that the service to change the timing registers has returned before the arm begins moving.

tshumay commented 11 months ago

Great! Thanks for the quick action on that. I will integrate these changes to day and test things out in the next day or so and let you know, but those fixes seem totally reasonable at first glance.

tshumay commented 11 months ago

Just merged the code - the code in the arm.py seems correct, but in core.py, there are still several calls to self.executor.spin_once_until_future_complete(). Is that intentional?

lukeschmitt-tr commented 11 months ago

That will be a separate change once we finish testing those other functions locally.

tshumay commented 11 months ago

Great. In that case, I think this takes care of the bugs in the arm module I spotted. Thanks!