ODRL is the first benchmark for off-dynamics RL problems, in which there is a limited budget of target domain data while comparatively sufficient source domain data can be accessed. However, there exist dynamic discrepancies between the source domain and the target domain. The goal is to acquire better performance in the target domain by leveraging data from both domains.
ODRL provides rich implementations of recent off-dynamics RL methods and also introduces some extra baselines that treat the two domains as one mixed domain. Each algorithm is implemented in a single-file and research-friendly manner, which is heavily inspired by cleanrl and CORL libraries. All implemented algorithms share similar clean, easy-to-follow code styles.
ODRL considers four varied experimental settings for off-dynamics RL, where the source domain and the target domain can be either online or offline, e.g., Online-Online
setting indicates that both the source domain and the target domain are online, while Online-Offline
setting means that the source domain is online while the target domain is offline.
ODRL is related to numerous transfer RL/multi-task RL benchmarks. We include a comparison of ODRL against some commonly used benchmarks below, including D4RL, DMC suite, Meta-World, RLBench, CARL, Gym-extensions, Continual World.
Benchmark | Offline datasets | Diverse Domains | Multi-task | Single-task Dynamics Shift |
---|---|---|---|---|
D4RL | ✅ | ✅ | ❎ | ❎ |
DMC suite | ❎ | ✅ | ❎ | ❎ |
Meta-World | ❎ | ❎ | ✅ | ❎ |
RLBench | ✅ | ❎ | ✅ | ❎ |
CARL | ❎ | ✅ | ❎ | ✅ |
Gym-extensions | ❎ | ❎ | ✅ | ✅ |
Continual World | ❎ | ❎ | ✅ | ❎ |
ODRL | ✅ | ✅ | ❎ | ✅ |
Among these benchmarks, D4RL only contains single-domain offline datasets and does not focus on the off-dynamics RL issue. DMC suite contains a wide range of tasks, but it does not offer offline datasets and does not handle the off-dynamics RL. Meta-world is designed for the multi-task RL setting. RLBench provides demonstrations for numerous tasks but it does not involve the dynamics shift in a single task. CARL focuses on the setting where the context of the environment (e.g., reward, dynamics) can change between different episodes (i.e., it does not have a source domain or target domain, but only one domain where the dynamics or rewards can change depending on the context). CARL also does not provide offline datasets. Continual world is a benchmark for continual learning in RL, and also supports multi-task learning can be used for transfer RL policies. ODRL, instead, focuses on the setting where the agent can leverage source domain data to facilitate the policy training in the target domain, where the task in the source domain and the target domain keeps identical.
Our benchmark is installation-free, i.e., one does not need to run pip install -e .
. This design choice is motivated by the fact that users may have multiple local environments which actually share numerous packages like torch
, making it a waste of space to create another conda environment for running ODRL. Moreover, the provided packages may conflict with existing ones, posing a risk of corrupting the current environment. As a result, we do not offer a setup.py
file. ODRL relies on some most commonly adopted packages, which should be easily satisfied: python==3.8.13, torch==1.11.0, gym==0.18.3, dm-control==1.0.8, numpy==1.23.5, d4rl==1.1, mujoco-py==2.1.2.14
.
Nevertheless, we totally understand that some users may still need the detailed list of dependencies, and hence we also include the requirement.txt
in ODRL. To use it, run the following commands:
conda create -n offdynamics python=3.8.13 && conda activate offdynamics
pip install setuptools==63.2.0
pip install wheel==0.38.4
pip install -r requirement.txt
We summarize the benchmark overview below. We provide two metrics for evaluating the performance of the agent, return and the normalized score, which gives
$NS = \dfrac{J\pi - J{\rm random}}{J{\rm expert}- J{\rm random}}\times 100,$
where $J\pi$ is the return of the agent in the target domain, $J{\rm expert}$ is the return of an expert policy, and $J_{\rm random}$ is the reference score of the random policy. Please check out the corresponding reference scores for the expert policy and the random policy of all tasks in envs/infos.py
.
Task Domain | Friction | Gravity | Kinematic | Morphology | Map Layout | Offline Datasets |
---|---|---|---|---|---|---|
Locomotion | ✅ | ✅ | ✅ | ✅ | ❎ | ✅ |
Navigation | ❎ | ❎ | ❎ | ❎ | ✅ | ✅ |
Dexterous Manipulation | ❎ | ❎ | ✅ | ✅ | ❎ | ✅ |
algo
contains the implemented off-dynamics RL algorithms as well as our introduced baseline methods. These algorithms are categoried by varied experimental settings.config
contains the yaml
configuration files for each algorithm across different domainsenvs
contains various domains and the revised xml
files of the environments with dynamics shift.dataset
is the folder where the offline target domain datasets are stored (one needs to manually download them from here)imgs
contains the illustration figure of this projectODRL contains the following experiemental settings:
We implement various baseline algorithms for each setting.
Algorithm | Variants Implemented |
---|---|
Online-Online Setting | |
✅ DARC | online_online/darc.py |
✅ VGDF | online_online/vgdf.py |
✅ PAR | online_online/par.py |
✅ SAC | online_online/sac.py |
✅ SAC_IW | online_online/sac_iw.py |
✅ SAC_tune | finetune/sac_tune.py |
Offline-Online Setting | |
✅ H2O | offline_online/h2o.py |
✅ BC_VGDF | offline_online/bc_vgdf.py |
✅ BC_PAR | offline_online/bc_par.py |
✅ BC_SAC | offline_online/bc_sac.py |
✅ CQL_SAC | offline_online/cql_sac.py |
✅ MCQ_SAC | offline_online/mcq_sac.py |
✅ RLPD | offline_online/rlpd.py |
Online-Offline Setting | |
✅ H2O | online_offline/h2o.py |
✅ PAR_BC | online_offline/bc_par.py |
✅ SAC_BC | online_offline/sac_bc.py |
✅ SAC_CQL | online_offline/sac_cql.py |
✅ SAC_MCQ | online_offline/sac_mcq.py |
Offline-Offline Setting | |
✅ IQL | offline_offline/iql.py |
✅ TD3_BC | offline_offline/td3_bc.py |
✅ DARA | offline_offline/dara.py |
✅ BOSA | offline_offline/bosa.py |
It is worth noting that when running SAC_tune, one needs to use train_tune.py
instead of train.py
.
We run all four experimental settings with the train.py
file, with mode 0
denotes the Online-Online setting, mode 1
denotes the Offline-Online seting, mode 2
specifies the Online-Offline setting, and mode 3
means the Offline-Offline setting. One can switch different setting by specifying the --mode
flag. The default value is 0, i.e., Online-Online setting. We give an example of how to use our benchmark below:
# online-online
CUDA_VISIBLE_DEVICES=0 python train.py --policy DARC --env hopper-kinematic-legjnt --shift_level easy --seed 1 --mode 0 --dir runs
# offline-online
CUDA_VISIBLE_DEVICES=0 python train.py --policy CQL_SAC --env ant-friction --shift_level 0.5 --srctype medium-replay --seed 1 --mode 1 --dir runs
# online-offline
CUDA_VISIBLE_DEVICES=0 python train.py --policy PAR_BC --env ant-morph-alllegs --shift_level hard --tartype expert --seed 1 --mode 2 --dir runs
# offline-offline
CUDA_VISIBLE_DEVICES=0 python train.py --policy BOSA --env walker2d-kinematic-footjnt --shift_level medium --srctype medium --tartype medium --seed 1 --mode 3 --dir runs
We explain some key flags below:
--env
specifies the name of the target domain, and the source domain will be automatically prepared--shift_level
specifies the shift level for the task--srctype
specifies the dataset quality of the source domain dataset--tartype
specifies the dataset quality of the target domain dataset--params
specifies the hyperparameter for the underlying algorithm if one wants to change the default hyperparameters, e.g., --params '{"actor_lr": 0.003}'
We directly adopt offline source domain datasets from the popular D4RL library. Please note that different dynamics shift tasks have varied shift levels. We summarize the shift levels for different tasks below.
Task | Supported Shift Levels |
---|---|
Locomotion friction/gravity | 0.1, 0.5, 2.0, 5.0 |
Locomotion kinematic/morphology | easy, medium, hard |
Antmaze small maze | centerblock, empty, lshape, zshape, reverseu, reversel |
Antmaze medium/large maze | 1, 2, 3, 4, 5, 6 |
Dexterous Manipulation | easy, medium, hard |
Our repository is licensed under the MIT licence. The adopted Gym environments and mujoco-py are also licensed under the MIT License. For the D4RL library (including the Antmaze domain and the Adroit domain, and offline datasets), all datasets are licensed under the Creative Commons Attribution 4.0 License (CC BY), and code is licensed under the Apache 2.0 License.
We plan to support more real-world robotic environments and include implementations of recent off-dynamics RL algorithms.
If you use ODRL in your research, please consider citing our work
@inproceedings{lyu2024odrlabenchmark,
title={ODRL: A Benchmark for Off-Dynamics Reinforcement Learning},
author={Lyu, Jiafei and Xu, Kang and Xu, Jiacheng and Yan, Mengbei and Yang, Jingwen and Zhang, Zongzhang and Bai, Chenjia and Lu, Zongqing and Li, Xiu},
booktitle={Advances in Neural Information Processing Systems},
year={2024}
}