OffDynamicsRL / off-dynamics-rl

MIT License
12 stars 0 forks source link

ODRL: An Off-dynamics Reinforcement Learning Benchmark


MIT License Python 3.8

A brief overview of the ODRL benchmark.

ODRL is the first benchmark for off-dynamics RL problems, in which there is a limited budget of target domain data while comparatively sufficient source domain data can be accessed. However, there exist dynamic discrepancies between the source domain and the target domain. The goal is to acquire better performance in the target domain by leveraging data from both domains.

ODRL provides rich implementations of recent off-dynamics RL methods and also introduces some extra baselines that treat the two domains as one mixed domain. Each algorithm is implemented in a single-file and research-friendly manner, which is heavily inspired by cleanrl and CORL libraries. All implemented algorithms share similar clean, easy-to-follow code styles.

ODRL considers four varied experimental settings for off-dynamics RL, where the source domain and the target domain can be either online or offline, e.g., Online-Online setting indicates that both the source domain and the target domain are online, while Online-Offline setting means that the source domain is online while the target domain is offline.

Connections and Comparison against Other Benchmarks

ODRL is related to numerous transfer RL/multi-task RL benchmarks. We include a comparison of ODRL against some commonly used benchmarks below, including D4RL, DMC suite, Meta-World, RLBench, CARL, Gym-extensions, Continual World.

Benchmark Offline datasets Diverse Domains Multi-task Single-task Dynamics Shift
D4RL
DMC suite
Meta-World
RLBench
CARL
Gym-extensions
Continual World
ODRL

Among these benchmarks, D4RL only contains single-domain offline datasets and does not focus on the off-dynamics RL issue. DMC suite contains a wide range of tasks, but it does not offer offline datasets and does not handle the off-dynamics RL. Meta-world is designed for the multi-task RL setting. RLBench provides demonstrations for numerous tasks but it does not involve the dynamics shift in a single task. CARL focuses on the setting where the context of the environment (e.g., reward, dynamics) can change between different episodes (i.e., it does not have a source domain or target domain, but only one domain where the dynamics or rewards can change depending on the context). CARL also does not provide offline datasets. Continual world is a benchmark for continual learning in RL, and also supports multi-task learning can be used for transfer RL policies. ODRL, instead, focuses on the setting where the agent can leverage source domain data to facilitate the policy training in the target domain, where the task in the source domain and the target domain keeps identical.

🚀Getting Started

Our benchmark is installation-free, i.e., one does not need to run pip install -e .. This design choice is motivated by the fact that users may have multiple local environments which actually share numerous packages like torch, making it a waste of space to create another conda environment for running ODRL. Moreover, the provided packages may conflict with existing ones, posing a risk of corrupting the current environment. As a result, we do not offer a setup.py file. ODRL relies on some most commonly adopted packages, which should be easily satisfied: python==3.8.13, torch==1.11.0, gym==0.18.3, dm-control==1.0.8, numpy==1.23.5, d4rl==1.1, mujoco-py==2.1.2.14.

Nevertheless, we totally understand that some users may still need the detailed list of dependencies, and hence we also include the requirement.txt in ODRL. To use it, run the following commands:

conda create -n offdynamics python=3.8.13 && conda activate offdynamics
pip install setuptools==63.2.0
pip install wheel==0.38.4
pip install -r requirement.txt

We summarize the benchmark overview below. We provide two metrics for evaluating the performance of the agent, return and the normalized score, which gives

$NS = \dfrac{J\pi - J{\rm random}}{J{\rm expert}- J{\rm random}}\times 100,$

where $J\pi$ is the return of the agent in the target domain, $J{\rm expert}$ is the return of an expert policy, and $J_{\rm random}$ is the reference score of the random policy. Please check out the corresponding reference scores for the expert policy and the random policy of all tasks in envs/infos.py.

Task Domain Friction Gravity Kinematic Morphology Map Layout Offline Datasets
Locomotion
Navigation
Dexterous Manipulation

🚀🚀Code Structure

🚀🚀Experimental Settings and Implemented Algorithms

ODRL contains the following experiemental settings:

We implement various baseline algorithms for each setting.

Algorithm Variants Implemented
Online-Online Setting
DARC online_online/darc.py
VGDF online_online/vgdf.py
PAR online_online/par.py
SAC online_online/sac.py
✅ SAC_IW online_online/sac_iw.py
✅ SAC_tune finetune/sac_tune.py
Offline-Online Setting
H2O offline_online/h2o.py
BC_VGDF offline_online/bc_vgdf.py
BC_PAR offline_online/bc_par.py
✅ BC_SAC offline_online/bc_sac.py
CQL_SAC offline_online/cql_sac.py
MCQ_SAC offline_online/mcq_sac.py
RLPD offline_online/rlpd.py
Online-Offline Setting
H2O online_offline/h2o.py
PAR_BC online_offline/bc_par.py
✅ SAC_BC online_offline/sac_bc.py
SAC_CQL online_offline/sac_cql.py
SAC_MCQ online_offline/sac_mcq.py
Offline-Offline Setting
IQL offline_offline/iql.py
TD3_BC offline_offline/td3_bc.py
DARA offline_offline/dara.py
BOSA offline_offline/bosa.py

It is worth noting that when running SAC_tune, one needs to use train_tune.py instead of train.py.

🚀🚀🚀How to Run

We run all four experimental settings with the train.py file, with mode 0 denotes the Online-Online setting, mode 1 denotes the Offline-Online seting, mode 2 specifies the Online-Offline setting, and mode 3 means the Offline-Offline setting. One can switch different setting by specifying the --mode flag. The default value is 0, i.e., Online-Online setting. We give an example of how to use our benchmark below:

# online-online
CUDA_VISIBLE_DEVICES=0 python train.py --policy DARC --env hopper-kinematic-legjnt --shift_level easy --seed 1 --mode 0 --dir runs
# offline-online
CUDA_VISIBLE_DEVICES=0 python train.py --policy CQL_SAC --env ant-friction --shift_level 0.5 --srctype medium-replay --seed 1 --mode 1 --dir runs
# online-offline
CUDA_VISIBLE_DEVICES=0 python train.py --policy PAR_BC --env ant-morph-alllegs --shift_level hard --tartype expert --seed 1 --mode 2 --dir runs
# offline-offline
CUDA_VISIBLE_DEVICES=0 python train.py --policy BOSA --env walker2d-kinematic-footjnt --shift_level medium --srctype medium --tartype medium --seed 1 --mode 3 --dir runs

We explain some key flags below:

We directly adopt offline source domain datasets from the popular D4RL library. Please note that different dynamics shift tasks have varied shift levels. We summarize the shift levels for different tasks below.

Task Supported Shift Levels
Locomotion friction/gravity 0.1, 0.5, 2.0, 5.0
Locomotion kinematic/morphology easy, medium, hard
Antmaze small maze centerblock, empty, lshape, zshape, reverseu, reversel
Antmaze medium/large maze 1, 2, 3, 4, 5, 6
Dexterous Manipulation easy, medium, hard

Licences

Our repository is licensed under the MIT licence. The adopted Gym environments and mujoco-py are also licensed under the MIT License. For the D4RL library (including the Antmaze domain and the Adroit domain, and offline datasets), all datasets are licensed under the Creative Commons Attribution 4.0 License (CC BY), and code is licensed under the Apache 2.0 License.

TODO

We plan to support more real-world robotic environments and include implementations of recent off-dynamics RL algorithms.

📄Citing ODRL

If you use ODRL in your research, please consider citing our work

@inproceedings{lyu2024odrlabenchmark,
 title={ODRL: A Benchmark for Off-Dynamics Reinforcement Learning},
 author={Lyu, Jiafei and Xu, Kang and Xu, Jiacheng and Yan, Mengbei and Yang, Jingwen and Zhang, Zongzhang and Bai, Chenjia and Lu, Zongqing and Li, Xiu},
 booktitle={Advances in Neural Information Processing Systems},
 year={2024}
}