ODRL: An Off-dynamics Reinforcement Learning Benchmark

A brief overview of the ODRL benchmark.

ODRL is the first benchmark for off-dynamics RL problems, in which there is a limited budget of target domain data while comparatively sufficient source domain data can be accessed. However, there exist dynamic discrepancies between the source domain and the target domain. The goal is to acquire better performance in the target domain by leveraging data from both domains.

ODRL provides rich implementations of recent off-dynamics RL methods and also introduces some extra baselines that treat the two domains as one mixed domain. Each algorithm is implemented in a single-file and research-friendly manner, which is heavily inspired by cleanrl and CORL libraries. All implemented algorithms share similar clean, easy-to-follow code styles.

ODRL considers four varied experimental settings for off-dynamics RL, where the source domain and the target domain can be either online or offline, e.g., Online-Online setting indicates that both the source domain and the target domain are online, while Online-Offline setting means that the source domain is online while the target domain is offline.

✨ Single-file algorithm implementation
✨ Supporting various experimental settings
✨ Offering offline target domain datasets
✨ Supporting a wide spectrum of dynamics shifts

Connections and Comparison against Other Benchmarks

ODRL is related to numerous transfer RL/multi-task RL benchmarks. We include a comparison of ODRL against some commonly used benchmarks below, including D4RL, DMC suite, Meta-World, RLBench, CARL, Gym-extensions, Continual World.

Benchmark	Offline datasets	Diverse Domains	Multi-task	Single-task Dynamics Shift
D4RL	✅	✅	❎	❎
DMC suite	❎	✅	❎	❎
Meta-World	❎	❎	✅	❎
RLBench	✅	❎	✅	❎
CARL	❎	✅	❎	✅
Gym-extensions	❎	❎	✅	✅
Continual World	❎	❎	✅	❎
ODRL	✅	✅	❎	✅

Among these benchmarks, D4RL only contains single-domain offline datasets and does not focus on the off-dynamics RL issue. DMC suite contains a wide range of tasks, but it does not offer offline datasets and does not handle the off-dynamics RL. Meta-world is designed for the multi-task RL setting. RLBench provides demonstrations for numerous tasks but it does not involve the dynamics shift in a single task. CARL focuses on the setting where the context of the environment (e.g., reward, dynamics) can change between different episodes (i.e., it does not have a source domain or target domain, but only one domain where the dynamics or rewards can change depending on the context). CARL also does not provide offline datasets. Continual world is a benchmark for continual learning in RL, and also supports multi-task learning can be used for transfer RL policies. ODRL, instead, focuses on the setting where the agent can leverage source domain data to facilitate the policy training in the target domain, where the task in the source domain and the target domain keeps identical.

🚀Getting Started

Our benchmark is installation-free, i.e., one does not need to run pip install -e .. This design choice is motivated by the fact that users may have multiple local environments which actually share numerous packages like torch, making it a waste of space to create another conda environment for running ODRL. Moreover, the provided packages may conflict with existing ones, posing a risk of corrupting the current environment. As a result, we do not offer a setup.py file. ODRL relies on some most commonly adopted packages, which should be easily satisfied: python==3.8.13, torch==1.11.0, gym==0.18.3, dm-control==1.0.8, numpy==1.23.5, d4rl==1.1, mujoco-py==2.1.2.14.

Nevertheless, we totally understand that some users may still need the detailed list of dependencies, and hence we also include the requirement.txt in ODRL. To use it, run the following commands:

conda create -n offdynamics python=3.8.13 && conda activate offdynamics
pip install setuptools==63.2.0
pip install wheel==0.38.4
pip install -r requirement.txt

We summarize the benchmark overview below. We provide two metrics for evaluating the performance of the agent, return and the normalized score, which gives

$NS = \dfrac{J\pi - J{\rm random}}{J{\rm expert}- J{\rm random}}\times 100,$

where $J\pi$ is the return of the agent in the target domain, $J{\rm expert}$ is the return of an expert policy, and $J_{\rm random}$ is the reference score of the random policy. Please check out the corresponding reference scores for the expert policy and the random policy of all tasks in envs/infos.py.

Task Domain	Friction	Gravity	Kinematic	Morphology	Map Layout	Offline Datasets
Locomotion	✅	✅	✅	✅	❎	✅
Navigation	❎	❎	❎	❎	✅	✅
Dexterous Manipulation	❎	❎	✅	✅	❎	✅

🚀🚀Code Structure

algo contains the implemented off-dynamics RL algorithms as well as our introduced baseline methods. These algorithms are categoried by varied experimental settings.
config contains the yaml configuration files for each algorithm across different domains
envs contains various domains and the revised xml files of the environments with dynamics shift.
dataset is the folder where the offline target domain datasets are stored (one needs to manually download them from here)
imgs contains the illustration figure of this project

🚀🚀Experimental Settings and Implemented Algorithms

ODRL contains the following experiemental settings:

Online-Online setting (online source domain and online target domain)
Offline-Online setting (offline source domain and online target domain)
Online-Offline setting (online source domain and offline target domain)
Offline-Offline setting (offline source domain and offline target domain)

We implement various baseline algorithms for each setting.

Algorithm	Variants Implemented
Online-Online Setting
✅ DARC	`online_online/darc.py`
✅ VGDF	`online_online/vgdf.py`
✅ PAR	`online_online/par.py`
✅ SAC	`online_online/sac.py`
✅ SAC_IW	`online_online/sac_iw.py`
✅ SAC_tune	`finetune/sac_tune.py`
Offline-Online Setting
✅ H2O	`offline_online/h2o.py`
✅ BC_VGDF	`offline_online/bc_vgdf.py`
✅ BC_PAR	`offline_online/bc_par.py`
✅ BC_SAC	`offline_online/bc_sac.py`
✅ CQL_SAC	`offline_online/cql_sac.py`
✅ MCQ_SAC	`offline_online/mcq_sac.py`
✅ RLPD	`offline_online/rlpd.py`
Online-Offline Setting
✅ H2O	`online_offline/h2o.py`
✅ PAR_BC	`online_offline/bc_par.py`
✅ SAC_BC	`online_offline/sac_bc.py`
✅ SAC_CQL	`online_offline/sac_cql.py`
✅ SAC_MCQ	`online_offline/sac_mcq.py`
Offline-Offline Setting
✅ IQL	`offline_offline/iql.py`
✅ TD3_BC	`offline_offline/td3_bc.py`
✅ DARA	`offline_offline/dara.py`
✅ BOSA	`offline_offline/bosa.py`

It is worth noting that when running SAC_tune, one needs to use train_tune.py instead of train.py.

🚀🚀🚀How to Run

We run all four experimental settings with the train.py file, with mode 0 denotes the Online-Online setting, mode 1 denotes the Offline-Online seting, mode 2 specifies the Online-Offline setting, and mode 3 means the Offline-Offline setting. One can switch different setting by specifying the --mode flag. The default value is 0, i.e., Online-Online setting. We give an example of how to use our benchmark below:

# online-online
CUDA_VISIBLE_DEVICES=0 python train.py --policy DARC --env hopper-kinematic-legjnt --shift_level easy --seed 1 --mode 0 --dir runs
# offline-online
CUDA_VISIBLE_DEVICES=0 python train.py --policy CQL_SAC --env ant-friction --shift_level 0.5 --srctype medium-replay --seed 1 --mode 1 --dir runs
# online-offline
CUDA_VISIBLE_DEVICES=0 python train.py --policy PAR_BC --env ant-morph-alllegs --shift_level hard --tartype expert --seed 1 --mode 2 --dir runs
# offline-offline
CUDA_VISIBLE_DEVICES=0 python train.py --policy BOSA --env walker2d-kinematic-footjnt --shift_level medium --srctype medium --tartype medium --seed 1 --mode 3 --dir runs

We explain some key flags below:

--env specifies the name of the target domain, and the source domain will be automatically prepared
--shift_level specifies the shift level for the task
--srctype specifies the dataset quality of the source domain dataset
--tartype specifies the dataset quality of the target domain dataset
--params specifies the hyperparameter for the underlying algorithm if one wants to change the default hyperparameters, e.g., --params '{"actor_lr": 0.003}'

We directly adopt offline source domain datasets from the popular D4RL library. Please note that different dynamics shift tasks have varied shift levels. We summarize the shift levels for different tasks below.

Task	Supported Shift Levels
Locomotion friction/gravity	0.1, 0.5, 2.0, 5.0
Locomotion kinematic/morphology	easy, medium, hard
Antmaze small maze	centerblock, empty, lshape, zshape, reverseu, reversel
Antmaze medium/large maze	1, 2, 3, 4, 5, 6
Dexterous Manipulation	easy, medium, hard

Licences

Our repository is licensed under the MIT licence. The adopted Gym environments and mujoco-py are also licensed under the MIT License. For the D4RL library (including the Antmaze domain and the Adroit domain, and offline datasets), all datasets are licensed under the Creative Commons Attribution 4.0 License (CC BY), and code is licensed under the Apache 2.0 License.

TODO

We plan to support more real-world robotic environments and include implementations of recent off-dynamics RL algorithms.

Add tasks on Sawyer robots (based on Meta-World)
Support Humanoid tasks in ODRL
Support Gymnasium
and more!

📄Citing ODRL

If you use ODRL in your research, please consider citing our work

@inproceedings{lyu2024odrlabenchmark,
 title={ODRL: A Benchmark for Off-Dynamics Reinforcement Learning},
 author={Lyu, Jiafei and Xu, Kang and Xu, Jiacheng and Yan, Mengbei and Yang, Jingwen and Zhang, Zongzhang and Bai, Chenjia and Lu, Zongqing and Li, Xiu},
 booktitle={Advances in Neural Information Processing Systems},
 year={2024}
}

OffDynamicsRL / off-dynamics-rl

readme