If you find Safe Policy Optimization useful, please cite it in your publications.
@article{ji2023safety,
title={Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark},
author={Ji, Jiaming and Zhang, Borong and Zhou, Jiayi and Pan, Xuehai and Huang, Weidong and Sun, Ruiyang and Geng, Yiran and Zhong, Yifan and Dai, Juntao and Yang, Yaodong},
journal={arXiv preprint arXiv:2310.12567},
year={2023}
}
What's New:
Safe Policy Optimization (SafePO) is a comprehensive algorithm benchmark for Safe Reinforcement Learning (Safe RL). It provides RL research community with a unified platform for processing and evaluating algorithms in various safe reinforcement learning environments. In order to better help the community study this problem, SafePO is developed with the following key features:
Correctness. For a benchmark, it is critical to ensure its correctness and reliability. To achieve this goal, we examine the implementation of SafePO carefully. Firstly, each algorithm is implemented strictly according to the original paper (e.g., ensuring consistency with the gradient flow of the original paper, etc). Secondly, for algorithms with a commonly acknowledged open-source code base, we compare our implementation with those line by line, in order to double-check the correctness. Finally, we compare SafePO with existing benchmarks (e.g., Safety-Starter-Agents and RL-Safety-Algorithms) outperforms other existing implementations.
Extensibility. SafePO enjoys high extensibility thanks to its architecture. New algorithms can be integrated to SafePO by inheriting from base algorithms and only implementing their unique features. For example, we integrate PPO by inheriting from policy gradient and only adding the clip ratio variable and rewriting the function that computes the loss of policy. In a similar way, algorithms can be easily added to SafePO.
Logging and Visualization. Another important functionality of SafePO is logging and visualization. Supporting both TensorBoard and WandB, we offer code for the visualizations of more than 40 parameters and intermediate computation results, for the purpose of inspecting the training process. Common parameters and metrics such as KL-divergence, SPS (step per second), and variance of cost are visualized universally. During training, users are able to inspect the changes of every parameter, collect the log file, and obtain saved checkpoint models. The complete and comprehensive visualization allows easier observation, model selection, and comparison.
Documentation. In addition to its code implementation, SafePO comes with an extensive documentation. We include detailed guidance on installation and propose solutions to common issues. Moreover, we provide instructions on simple usage and advanced customization of SafePO. Official information concerning maintenance, ethical and responsible use are stated clearly for reference.
Here we provide a table of Safe RL algorithms that the benchmark includes.
note: Four more classic RL algorithms are also included in the benchmark, namely PG, NaturalPG, TRPO, and PPO.
Algorithm | Proceedings&Cites | Official Code Repo | Official Code Last Update | Official Github Stars |
---|---|---|---|---|
PPO-Lag | :x: | Tensorflow 1 | ||
TRPO-Lag | :x: | Tensorflow 1 | ||
CUP | Neurips 2022 (Cite: 6) | Pytorch | ||
FOCOPS | Neurips 2020 (Cite: 27) | Pytorch | ||
CPO | ICML 2017(Cite: 663) | :x: | :x: | :x: |
PCPO | ICLR 2020(Cite: 67) | Theano | :x: | :x: |
RCPO | ICLR 2019 (Cite: 238) | :x: | :x: | :x: |
CPPO-PID | Neurips 2020(Cite: 71) | Pytorch | ||
MACPO | Preprint(Cite: 4) | Pytorch | ||
MAPPO-Lag | Preprint(Cite: 4) | Pytorch | ||
HAPPO (Purely reward optimisation) | ICLR 2022 (Cite: 10) | Pytorch | ||
MAPPO (Purely reward optimisation) | Preprint(Cite: 98) | Pytorch |
For more details, please refer to Safety-Gymnasium.
Category | Task | Agent | Example |
---|---|---|---|
Safe Navigation | Goal[012] | Point, Car, Doggo, Racecar, Ant | SafetyPointGoal1-v0 |
Button[012] | |||
Push[012] | |||
Circle[012] | |||
Safe Velocity | Velocity | HalfCheetah, Hopper, Swimmer, Walker2d, Ant, Humanoid | SafetyAntVelocity-v1 |
Safe Multi-Agent | MultiGoal[012] | Multi-Point, Multi-Ant | SafetyAntMultiGoal1-v0 |
Multi-Agent Velocity | 6x1HalfCheetah, 2x3HalfCheetah, 3x1Hopper, 2x1Swimmer, 2x3Walker2d, 2x4Ant, 4x2Ant, 9|8Humanoid | Safety2x4AntVelocity-v0 | |
Safe Isaac Gym | FreightFrankaCloseDrawer | FreightFranka | FreightFrankaCloseDrawer |
FreightFrankaPickAndPlace | |||
ShadowHandCatchOver2Underarm_Safe_finger | ShadowHands | ShadowHandCatchOver2Underarm_Safe_finger | |
ShadowHandCatchOver2Underarm_Safe_joint | |||
ShadowHandOver_Safe_finger | |||
ShadowHandOver_Safe_joint |
note:
python/examples
directory, like joint_monkey.py
.conda create -n safepo python=3.8
conda activate safepo
wget https://github.com/PKU-Alignment/safety-gymnasium/archive/refs/heads/main.zip
unzip main.zip
cd safety-gymnasium-main
pip install -e .
Base Environments | Description | Demo |
---|---|---|
ShadowHandOver | These environments involve two fixed-position hands. The hand which starts with the object must find a way to hand it over to the second hand. | |
ShadowHandCatchOver2Underarm | This environment is made up of half ShadowHandCatchUnderarm and half ShadowHandCatchOverarm, the object needs to be thrown from the vertical hand to the palm-up hand |
We implement some different constraints to the base environments, including Safe finger
and Safe joint
. For more details, please refer to Safety-Gymnasium
To use SafePO-Baselines, you need to install environments. Please refer to Safety-Gymnasium for more details on installation. Details regarding the installation of IsaacGym can be found here.
conda create -n safepo python=3.8
conda activate safepo
# because the cuda version, we recommend you install pytorch manual.
pip install -e .
To verify the performance of SafePO, you can run the following:
conda create -n safepo python=3.8
conda activate safepo
make benchmark
We also support simple benchmark commands for single-agent and multi-agent algorithms:
conda create -n safepo python=3.8
conda activate safepo
make simple-benchmark
The above commands will run all algorithms in sampled environments to get a quick overview of the performance of the algorithms.
Please notice that these commands would reinstall Safety-Gymnasium from PyPI. To run Safe Isaac Gym and Safe MultiGoal, please reinstall it manully from source by:
conda activate safepo
wget https://github.com/PKU-Alignment/safety-gymnasium/archive/refs/heads/main.zip
unzip main.zip
cd safety-gymnasium-main
pip install -e .
Each algorithm file is the entrance. Running ALGO.py
with arguments about algorithms and environments does the training. For example, to run PPO-Lag in SafetyPointGoal1-v0 with seed 0, you can use the following command:
cd safepo/single_agent
python ppo_lag.py --task SafetyPointGoal1-v0 --seed 0
To run a benchmark parallelly, for example, you can use the following commands to run PPO-Lag
, TRPO-Lag
in SafetyAntVelocity-v1
, SafetyHalfCheetahVelocity-v1
:
cd safepo/single_agent
python benchmark.py --tasks SafetyAntVelocity-v1 SafetyHalfCheetahVelocity-v1 --algo ppo_lag trpo_lag --workers 2
Commands above will run two processes in parallel, each process will run one algorithm in one environment. The results will be saved in ./runs/
.
We also provide a safe MARL algorithm benchmark on the challenging tasks of Safety-Gymnasium Safe Multi-Agent Velocity, Safe Isaac Gym and Safe MultiGoal tasks. HAPPO, MACPO, MAPPO-Lag and MAPPO have already been implemented.
To train a multi-agent algorithm:
cd safepo/multi_agent
python macpo.py --task Safety2x4AntVelocity-v0 --experiment benchmark
You can also train on Isaac Gym based environment if you have installed Isaac Gym.
cd safepo/multi_agent
python macpo.py --task ShadowHandOver_Safe_joint --experiment benchmark
After running the experiment, you can use the following command to plot the results:
cd safepo
python plot.py --logdir ./runs/benchmark
To evaluate the performance of the algorithm, you can use the following command:
cd safepo
python evaluate.py --benchmark-dir ./runs/benchmark
We test all algorithms and experiments on CPU: AMD Ryzen Threadripper PRO 3975WX 32-Cores and GPU: NVIDIA GeForce RTX 3090, Driver Version: 495.44. All of our experiments are run on Linux platform. If you encounter any problem in Mac or Windows, please feel free to open an issue.
SafePO aims to benefit safe RL community research, and is released under the Apache-2.0 license. Illegal usage or any violation of the license is not allowed.
The Baseline is a project contributed by PKU-Alignment at Peking University. We also thank the list of contributors of the following open source repositories: Spinning Up, Bullet-Safety-Gym, Safety-Gym.