Towards Robot-Assisted Data Generation with Minimal User Interaction for Autonomously Training 6D Pose Estimation in Operational Environments

Videos

A complementary video for our paper can be found here. A video show casing our grasping experiments can be found here:

alt text

Abstract

Recently developed deep neural networks achieved state-of-the-art results in the subject of 6D object pose estimation for robot manipulation. However, those supervised deep learning methods require expensive annotated training data. Current methods for reducing those costs frequently use synthetic data from simulations, but rely on expert knowledge and suffer from the domain gap when shifting to the real world. Here, we present a proof of concept for a novel approach of autonomously generating annotated training data for 6D object pose estimation. This approach is designed for learning new objects in operational environments while requiring little interaction and no expertise on the part of the user. We evaluate our autonomous data generation approach in two grasping experiments, where we archive a similar grasping success rate as related work on a non autonomously generated data set.

Terminal User Interface

We created a Terminal User Interface to quickly test our implementations. After setting up your hardware and installation, you can run it with:

$ python main.py

The Terminal User Interface gives you the following options:

Acquire New Data from Object: Collects new data samples for a given object.
Create Data Set: Creates a segmentation training data set or converts an existing segmentation training data set into a pose estimation training data set (Requires the pose labels).
Create Labels: Generates a segmentation label for each data sample of a given object via the background subtraction or an available trained segmentation model.
Create Pose labels: Generates the point clouds of the objects of a given segmentation data set and the finds the target object pose for each data sample based on the available segmentation label.
Grasp objects: Given a trained pose estimation model and the corresponding trained segmentation model, the robot moves into a grasping position. The user can now request the robot to move to a predefined view point in order to perceive the known objects in the scene and their pose. Afterwards, the user can request the robot to grasp the objects one by one in a random order. A corresponding video can be found here: https://youtu.be/qPjmZSX0crU.
Run Live Prediction: Either predicts validation data samples of a pose estimation training data set or predicts an incoming camera stream of RGB-D images.
Teach Grasping: Allows the user to teach the robot to grasp the objects included in a given pose estimation training data set by demonstraction.
Train Segmentation Model: Trains a given segmentation model with a given segmentation training data set.
Train Pose Estimation Model: Trains a "Dense Fusion" 6D pose estimation model given a pose estimation training data set.
Visualise: Visualises the created segmentation labels, the pose labels, and the object point clouds.

The selections 1, 5, 6, and 7 of the Terminal User Interface are hardware dependent (please see the Hardware section for setup instructions). The rest of the selections can be used with either data aquired by your setup, or with our data (please see the Data section for download instructions).

Installation

Dependencies:

Linux Distribution (we use Ubuntu 18.04 LTS)
Nvidia GPU with CUDA and cuDNN. Find a CUDA version suited for your Pytorch version (We use CUDA10.2).
Install python 3.6, we use for that Anaconda-Naviagtor
Realsense SDK and pyrealsense2 (optional for your RealSense DepthCamera)

Python 3.6 Packages (we put the pgk vision we used in brackets):

torch (1.5.1)
torchvision (0.6.1)
open3d (0.9.0.0)
segmentation-models-pytorch (0.1.3)
Pillow (8.1.2)
opencv-python (4.5.1.48)
matplotlib (3.2.1)
transforms3d (0.3.1)
[optional] pyrealsense2 (2.33.1.1388)

Run Terminal User Interface:

$ python main.py

Hardware

In order to conduct your own grasping experiments or aquire new data you need a RGB-D Camera, an industrial robot armm, and a gripper. We use an "Realsense-435" depth camera, the "UR-5 CB3" robot arm, and the "Robotiq 2F-85" gripper. Futhermore, you need to find the hand-eye-calibration for your setup, and adapt the robot view points used for data acquisition and grasping.

Robot and Gripper: We are not providing any drivers for the robot and gripper. If you want to use your own setup you will need to write your own drivers and comunication. We provide a "robot and gripper" controller in "robotcontroller/TestController.py". It uses a "robot and gripper" client to interact with the hardware. You can take this as a starting point to connect your hardware. Please make sure that all functions in the "RobotController" are callable.
Camera: If you want to use your own RGB-D Camera you can replace our "DepthCam" in ".depth_camera/DepthCam.py". Please make sure that the functions of the "DepthCam" are working simliar to our implementation. If you have a Realsense-435 you can use our DepthCam implementation. Please make sure you have installed the realsense sdk and pyrealsense2 (see dependencies).
Hand-Eye-Calibration: We use a aruco board for hand-eye-calibration. You can use our hand-eye-calibration implementations to get the camera poses. However, in order to get the robot poses you need to implement your own robot controller first (see the Hardware section). We do not provide the implementation our hand-eye-calibration method, since we reused it from an other project and it is written in c++ with furhter requirements. (our implementation is based on "CamOdoCal")
View-Points: The data aquireing requires a set of view points, which are unique to your setup. So remember to make your own set of viewpoints for your unique setup. The grasping also requires a set of viewpoints, which need to be updated according to your setup. Please find the viewpoints here. You can use our path creation or implement your own method.
Reference Point: In our setup we have defined a reference point

Data

All data used in the components of this project can be downloaded. A download link and instructions can be found in the readme of the each component. The comonents with data are "background_subtraction", "data_generation", "DenseFusion", "label_generator", "pc_reconstruction", "experiments", and "segmentation".

Aquired Data: If you do not have a hardware setup, you can download the RGB-D data aquired for this project follwing the instructions here.
Generate labels: You can generate your own segmentation label, point cloud, and target pose with selection 3 and 4 of the Terminal User Interface. Otherwise, you can download our generated labels by following the instructions here.
Extra Data: During data acquisition, we took every 50mm traveled an extra data sample, while the robot was moving. That data has no background image and a segmentation label can not be generated via background subtraction. However, we can use a segmentation model - once trained on the other data - to annotate the extra data. The extra data, in turn, can be used with the other data for the pose estimation training. However, experiments we conducted did not show any improvements when training with extra data. This might be due to poor labels related to motion blur and a time offset between the capured image and the robot pose. More detail can be found in the thesis report connected to this paper(see section Thesis Report below).

Experiments

You can reproduce our experiments with the Terminal User Interface.

You can use the "eval.py" to evaluate the ADD performance of our pose estimation model.
You can use the "gt_test.py" to evaluate the IoU, accurancy, precision, recall, and f1 score of the segmentation model and background subtraction model, with respekt to 20% hand labelled data. Download the reference standard here.
We use the "train_pose_estimation_exp.py" to train multiple pose estimation models with a decreasing numer of view points, and with or without extra data. You can use the "eval_exp.py" to evaluate how the ADD performacne changes when training pose estimation with data from less view points, and what effect the extra data has on the training. You can also quickly look into the thesis report for that (see section Thesis Report).

Background Subtraction

If you want to investigate the training of our background subtraction model you can do that here. You can also download the data used for the training here

Thesis Report

This work is based on the MasterThesis by Paul Koch. If you are interested in more details, you can download the full thesis report here.

KochPJ / AutoPoseEstimation

readme