This code implements the following paper:
Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos
Annie S. Chen, Suraj Nair, Chelsea Finn, 2021.
We are motivated by the goal of generalist robots that can complete a wide range of tasks across many environments. Critical to this is the robot's ability to acquire some metric of task success or reward, which is necessary for reinforcement learning, planning, or knowing when to ask for help. For a general-purpose robot operating in the real world, this reward function must also be able to generalize broadly across environments, tasks, and objects, while depending only on on-board sensor observations (e.g. RGB images). While deep learning on large and diverse datasets has shown promise as a path towards such generalization in computer vision and natural language, collecting high quality datasets of robotic interaction at scale remains an open challenge. In contrast, in-the-wild videos of humans (e.g. YouTube) contain an extensive collection of people doing interesting tasks across a diverse range of settings. In this work, we propose a simple approach, Domain-agnostic Video Discriminator (DVD), that learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task, and can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos. We find that by leveraging diverse human datasets, this reward function (a) can generalize zero shot to unseen environments, (b) generalize zero shot to unseen tasks, and (c) can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
Download the Something-Something-V2 dataset using the instructions in the original repo here.
Clone this repository by running:
git clone https://github.com/anniesch/dvd.git
cd dvd
Install Mujoco 2.0 and mujoco-py. Instructions for this are here.
Create and activate conda environment with the required prerequisites:
conda env create -f conda_env.yml
conda activate dvd
Our simulation environment depends on Meta-World. Install it here.
Install the simulation env by running:
cd sim_env
pip install -e .
All default args are listed in here.
A pretrained DVD classifier trained on 3 tasks worth of robot demos and 6 tasks worth of human demos is here:
trained_models/dvd_human_tasks_6_robot_tasks_3.pth.tar
The pretrained Sth-Sth video encoder is here:
trained_models/video_encoder/model_best.pth.tar
Sample command for training:
python train.py --num_tasks 6 --traj_length 0 --log_dir "test/" --similarity --batch_size 24 --im_size 120 --seed 1 --lr 0.01 --pretrained --human_data_dir [HUMAN_DATA_DIR] --sim_dir demos/ --human_tasks 5 41 44 46 93 94 --robot_tasks 5 41 93 --add_demos 60
All arg descriptions are located in utils.py. The traj_length
arg denotes the length of video clips to train on, with 0 indicating random lengths between 20-40. The pretrained
arg indicates using the pretrained Sth-Sth video encoder. The human and robot tasks to train on are indicated in a list through the human_tasks
and robot_tasks
args, where the numbers refer to the corresponding tasks in something-something-v2-labels.json
. For example, the task number 5 corresponds to "Closing something", 41 to "Moving something away from the camera", and 93 to "Moving something from left to right." -add_demos
indicates training on 60 videos for each of the robot tasks.
We test on four different environments, each with a drawer, faucet, mug, and coffee machine. env1
is the original environments, env2
is with changed colors, env3
is with a changed viewpoint, env4
is with an altered object arrangement.
For planning, we first train a visual dynamics model using Stochastic Variational Video Prediction (SV2P) using 10k episodes of randomly collected data. The code base can be found here.
To run planning with a trained SV2P model, the following is an example command, with the above DVD model:
python sv2p_plan.py --num_epochs 100 --num_tasks 6 --task_num 5 --seed 0 --sample_sz 100 --similarity 1 --num_demos 3 --model_dir pretrained/dvd_human_tasks_6_robot_tasks_3.pth.tar --xml env1 --cem_iters 0 --root ./ --sv2p_root [PATH TO SV2P MODEL]
Description of args: num_epochs
is the number of planning trials, num_tasks
is the number of total tasks that the DVD model was trained with, task_num
denotes the task desired, so setting to 5 designates closing the drawer as the desired tasks, num_demos
denotes the number of demos to randomly choose from, model_dir
is the path to the trained DVD model, xml
is the environment (so env1 is the original training environment), and cem_iters
is the number of iterations of cem to use.
To evaluate the success rate on the tasks, run python analysis.py --pwd [PATH_TO_PLANNING]
.
To collect random data in an environment, use python collect_data.py --random --xml env1
, where xml
denotes the desired environment (env1, env2, env3, or env4).
Demos can also be collected via hard-coding by calling python collect_data.py --xml env1
and altering the desired trajectory of the robot arm through goals
in line 97 of the collect_data.py
file.
We provide demos in the original simulation environment used for training DVD in demos/
, which were collected using MPC with a ground truth reward and are very imperfect trajectories, so hard-coding demonstrations is recommended if a higher quality is needed.