facebookresearch / fairo

A modular embodied agent architecture and platform for building embodied agents
MIT License
843 stars 87 forks source link

Polymetis drops a significant number of commanded actions due to communication failure #729

Open mihdalal opened 2 years ago

mihdalal commented 2 years ago

Type of Issue

Select the type of issue:

Description

Unsure if this is actually a bug or just a hardware issue but I find that polymetis ends up dropping a significant number of commanded actions due to packet failure. This is the error I get on my NUC (when running random end-effector control actions using the polymetis API):

Loaded new controller.                                                                                                                                                       
Setting Torch policy to terminated.                                                                                      
Setting Torch policy to terminated.                                                                                                                                 
Terminating custom controller, switching to default controller.                                                                                                                        
Loaded new controller.                                                                                                                                              
libfranka: Move command aborted: motion aborted by reflex! ["communication_constraints_violation"]
control_command_success_rate: 0.7154 packets lost in a row in the last sample: 27                                                                                   
.                                                                                                                                                                   
Performing automatic error recovery. This calls franka::Robot::automaticErrorRecovery, which is equivalent to pressing and releasing the external activation device.
Automatic error recovery attempt 1/3 ...                                                                                                                            
Robot operation recovered.                                                                                     
.                                                                                                                                                                                     
Warning: Interrupted control update greater than threshold of 1000000000 ns. Reverting to default controller...                                                                       
Terminating custom controller, switching to default controller.                                                                                                     
Loaded new controller.                                                                                                                                              
libfranka: Move command aborted: motion aborted by reflex! ["communication_constraints_violation"]             
control_command_success_rate: 0.7326 packets lost in a row in the last sample: 26
.                                                                                                                                                                                      
Performing automatic error recovery. This calls franka::Robot::automaticErrorRecovery, which is equivalent to pressing and releasing the external activation device.                   
Automatic error recovery attempt 1/3 ...                                                                                                                                               
Robot operation recovered.

Is this to be expected? I am finding that a significant number of actions are dropped and I've had to include retrying on failure in my code to prevent this sort of behavior, but that seems to be a hack at best. My code when getting this behavior:

import torch

from polymetis import RobotInterface, GripperInterface
import numpy as np
import time

if __name__ == "__main__":
    # Initialize robot interface
    robot = RobotInterface(
        ip_address="172.26.122.200",
    )
    # Reset
    robot.go_home()
    time.sleep(0.5)

    # Get ee pose
    for i in range(10):
        ee_pos, ee_quat = robot.pose_ee()
        print(f"Current ee position: {ee_pos}")
        print(f"Current ee orientation: {ee_quat}  (xyzw)")

        # Command robot to ee pose (move ee downwards)
        # note: can also be done with robot.move_ee_xyz
        delta_ee_pos_desired = torch.Tensor(np.random.uniform(0, .1, 3))
        ee_pos_desired = ee_pos + delta_ee_pos_desired
        print(f"\nMoving ee pos to: {ee_pos_desired} ...\n")
        state_log = robot.set_ee_pose(
            position=ee_pos_desired, orientation=None, time_to_go=2.0
        )

        # Get updated ee pose
        ee_pos, ee_quat = robot.pose_ee()
        print(f"New ee position: {ee_pos}")
        print(f"New ee orientation: {ee_quat}  (xyzw)")

My NUC has the following specs (in case its an issue with too weak of a CPU):

H/W path       Device     Class          Description
====================================================
                          system         NUC7i5BNH
/0                        bus            NUC7i5BNB
/0/0                      memory         64KiB BIOS
/0/35                     memory         8GiB System Memory
/0/35/0                   memory         8GiB SODIMM DDR4 Synchronous Unbuffered (Unregistered) 2400 MHz (0.4 ns)
/0/35/1                   memory         [empty]
/0/3a                     memory         128KiB L1 cache
/0/3b                     memory         512KiB L2 cache
/0/3c                     memory         4MiB L3 cache
/0/3d                     processor      Intel(R) Core(TM) i5-7260U CPU @ 2.20GHz
/0/100                    bridge         Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
/0/100/2                  display        Intel Corporation
/0/100/8                  generic        Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th Gen Core Processor Gaussian Mixture Model
/0/100/14                 bus            Sunrise Point-LP USB 3.0 xHCI Controller
/0/100/14/0    usb1       bus            xHCI Host Controller
/0/100/14/0/8             communication  Bluetooth wireless interface
/0/100/14/1    usb2       bus            xHCI Host Controller
/0/100/14.2               generic        Sunrise Point-LP Thermal subsystem
/0/100/16                 communication  Sunrise Point-LP CSME HECI #1
/0/100/17                 storage        Sunrise Point-LP SATA Controller [AHCI mode]
/0/100/1c                 bridge         Sunrise Point-LP PCI Express Root Port #1
/0/100/1c.5               bridge         Sunrise Point-LP PCI Express Root Port #6
/0/100/1c.5/0  wlp58s0    network        Wireless 8265 / 8275
/0/100/1c.7               bridge         Sunrise Point-LP PCI Express Root Port #8
/0/100/1c.7/0             generic        RTS5229 PCI Express Card Reader
/0/100/1f                 bridge         Intel(R) 100 Series Chipset Family LPC Controller/eSPI Controller - 9D4E
/0/100/1f.2               memory         Memory controller
/0/100/1f.3               multimedia     Sunrise Point-LP HD Audio
/0/100/1f.4               bus            Sunrise Point-LP SMBus
/0/100/1f.6    eno1       network        Ethernet Connection (4) I219-V
/0/1           scsi0      storage        
/0/1/0.0.0     /dev/sda   disk           250GB Seagate BarraCud
/0/1/0.0.0/1   /dev/sda1  volume         511MiB Windows FAT volume
/0/1/0.0.0/2   /dev/sda2  volume         232GiB EXT4 volume
/1                        power          To Be Filled By O.E.M.

My question is primarily: given my hardware is this behavior expected and I should buy a better NUC (if so what specs are recommended - this would be great to add to Documentation!) and if not, then any idea if there is a bug?

1heart commented 2 years ago

Hi Murtaza, sorry for the delay! -- I was on leave for several weeks, and just noticed this.

I just want to check that you're running this on the real-time patch. Also, there are cases where communication constraints are violated depending on the controller logic you're using. cc @exhaustin for any insight on move_ee performance.

If you want to hop on a call to debug together, also happy to do that!

exhaustin commented 2 years ago

Sorry for overlooking this github issue.

I think it is highly likely that you are being limited by your hardware. According to my experience Intel i5 @ 2.2GHz would struggle to maintain the real time communication loop. This was tested two years ago on a different setup so it might not be the case for your particular setup.

A good way to check if your particular hardware system is capable is to install libfranka natively by following the official instructions, then running the provided communication test example.

We've managed to successfully control the Franka using Polymetis on any NUC equipped with an i7 CPU, so If you are getting packet drops, then I would advise switching to one of those.

1heart commented 2 years ago

We recently fixed several performance-related issues which could have been impacting you, in addition to upgrading PyTorch 1.10 -- feel free to try it out!

1heart commented 2 years ago

Please raise a new issue if you encounter performance issues after this upgrade. Thanks!

stuart-fb commented 2 years ago

Hey - I'm running a NUC7 with i5 @ 2.2Ghz and seeing intermittent (every few seconds) communication constraint violation messages. Is this a known bad configuration? From the thread above it wasn't clear if the issue was solved by the perf updates in December or if older NUCs with i5s are too slow. I'm running 5.9.1-rt20 on Ubuntu 20.04.3. This NUC was stable running the ROS/frankalib stack on an older (5.4 series) rt kernel.

exhaustin commented 2 years ago

I have experience with i5 @ 2.2Ghz being underperformant with libfranka indepedent of Polymetis, but my experiences are limited to a few machines so I wouldn't say with certainty that the configuration is bad although it likely is.

Something you should do (on older i5 NUCs before deeming it too slow) is to disable CPU scaling and running the communication test to see if results are better. (Note that the libfranka communication test example is now included with the conda installation of Polymetis -- you no longer need to build libfranka to access it.)

exhaustin commented 2 years ago

Update relevant to this issue: A new version of Polymetis is now available which fixes the annoying issue of triggering "communication_constraints_violation" at times when sending a trajectory-based controller. For local builds, simply pull from latest main and rebuild. For conda installations: conda update -c pytorch -c fair-robotics -c aihabitat -c conda-forge polymetis

AlexanderKhazatsky commented 2 years ago

I'm having this same issue, but I can't pull the most recent version because it requires the most up to date lib Franka software, which would be a pain for me to install. is there another way I can address my issue?

1heart commented 2 years ago

@AlexanderKhazatsky Consider building from source & checking out your version of libfranka, as described in the docs

AlexanderKhazatsky commented 2 years ago

When I tried this last week, it didnt work because the build procedure assumed a certain lib Franka version, has this been addressed?

Also, I'm encountering these issues with the recommended NUC model.

1heart commented 2 years ago

On the main branch, please checkout the version of libfranka you require, as described in the docs linked above.

exhaustin commented 2 years ago

Discussions regarding building with alternative libfranka versions are continued at #1191