PKU-Alignment / Safe-Policy-Optimization

NeurIPS 2023: Safe Policy Optimization: A benchmark repository for safe reinforcement learning algorithms
https://safe-policy-optimization.readthedocs.io/en/latest/index.html
Apache License 2.0
331 stars 45 forks source link
benchmarks constrained-reinforcement-learning reinforcement-learning-algorithms safe safe-reinforcement-learning
[![Organization](https://img.shields.io/badge/Organization-PKU--Alignment-blue)](https://github.com/PKU-Alignment) [![License](https://img.shields.io/github/license/PKU-Alignment/Safe-Policy-Optimization?label=license)](#license) [![codecov](https://codecov.io/gh/PKU-Alignment/Safe-Policy-Optimization/graph/badge.svg?token=KF0UM0UNXW)](https://codecov.io/gh/PKU-Alignment/Safe-Policy-Optimization) [![Documentation Status](https://readthedocs.org/projects/safe-policy-optimization/badge/?version=latest)](https://safe-policy-optimization.readthedocs.io/en/latest/?badge=latest)

Citing Safe Policy Optimization

If you find Safe Policy Optimization useful, please cite it in your publications.

@article{ji2023safety,
  title={Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark},
  author={Ji, Jiaming and Zhang, Borong and Zhou, Jiayi and Pan, Xuehai and Huang, Weidong and Sun, Ruiyang and Geng, Yiran and Zhong, Yifan and Dai, Juntao and Yang, Yaodong},
  journal={arXiv preprint arXiv:2310.12567},
  year={2023}
}

What's New:

Safe Policy Optimization (SafePO) is a comprehensive algorithm benchmark for Safe Reinforcement Learning (Safe RL). It provides RL research community with a unified platform for processing and evaluating algorithms in various safe reinforcement learning environments. In order to better help the community study this problem, SafePO is developed with the following key features:

Correctness. For a benchmark, it is critical to ensure its correctness and reliability. To achieve this goal, we examine the implementation of SafePO carefully. Firstly, each algorithm is implemented strictly according to the original paper (e.g., ensuring consistency with the gradient flow of the original paper, etc). Secondly, for algorithms with a commonly acknowledged open-source code base, we compare our implementation with those line by line, in order to double-check the correctness. Finally, we compare SafePO with existing benchmarks (e.g., Safety-Starter-Agents and RL-Safety-Algorithms) outperforms other existing implementations.

Extensibility. SafePO enjoys high extensibility thanks to its architecture. New algorithms can be integrated to SafePO by inheriting from base algorithms and only implementing their unique features. For example, we integrate PPO by inheriting from policy gradient and only adding the clip ratio variable and rewriting the function that computes the loss of policy. In a similar way, algorithms can be easily added to SafePO.

Logging and Visualization. Another important functionality of SafePO is logging and visualization. Supporting both TensorBoard and WandB, we offer code for the visualizations of more than 40 parameters and intermediate computation results, for the purpose of inspecting the training process. Common parameters and metrics such as KL-divergence, SPS (step per second), and variance of cost are visualized universally. During training, users are able to inspect the changes of every parameter, collect the log file, and obtain saved checkpoint models. The complete and comprehensive visualization allows easier observation, model selection, and comparison.

Documentation. In addition to its code implementation, SafePO comes with an extensive documentation. We include detailed guidance on installation and propose solutions to common issues. Moreover, we provide instructions on simple usage and advanced customization of SafePO. Official information concerning maintenance, ethical and responsible use are stated clearly for reference.

Overview of Algorithms

Here we provide a table of Safe RL algorithms that the benchmark includes.

note: Four more classic RL algorithms are also included in the benchmark, namely PG, NaturalPG, TRPO, and PPO.

Algorithm Proceedings&Cites Official Code Repo Official Code Last Update Official Github Stars
PPO-Lag :x: Tensorflow 1 GitHub last commit GitHub stars
TRPO-Lag :x: Tensorflow 1 GitHub last commit GitHub stars
CUP Neurips 2022 (Cite: 6) Pytorch GitHub last commit GitHub stars
FOCOPS Neurips 2020 (Cite: 27) Pytorch GitHub last commit GitHub stars
CPO ICML 2017(Cite: 663) :x: :x: :x:
PCPO ICLR 2020(Cite: 67) Theano :x: :x:
RCPO ICLR 2019 (Cite: 238) :x: :x: :x:
CPPO-PID Neurips 2020(Cite: 71) Pytorch GitHub last commit GitHub stars
MACPO Preprint(Cite: 4) Pytorch GitHub last commit GitHub stars
MAPPO-Lag Preprint(Cite: 4) Pytorch GitHub last commit GitHub stars
HAPPO (Purely reward optimisation) ICLR 2022 (Cite: 10) Pytorch GitHub last commit GitHub stars
MAPPO (Purely reward optimisation) Preprint(Cite: 98) Pytorch GitHub last commit GitHub stars

Supported Environments: Safety-Gymnasium

For more details, please refer to Safety-Gymnasium.

Category Task Agent Example
Safe Navigation Goal[012] Point, Car, Doggo, Racecar, Ant SafetyPointGoal1-v0
Button[012]
Push[012]
Circle[012]
Safe Velocity Velocity HalfCheetah, Hopper, Swimmer, Walker2d, Ant, Humanoid SafetyAntVelocity-v1
Safe Multi-Agent MultiGoal[012] Multi-Point, Multi-Ant SafetyAntMultiGoal1-v0
Multi-Agent Velocity 6x1HalfCheetah, 2x3HalfCheetah, 3x1Hopper, 2x1Swimmer, 2x3Walker2d, 2x4Ant, 4x2Ant, 9|8Humanoid Safety2x4AntVelocity-v0
Safe Isaac Gym FreightFrankaCloseDrawer FreightFranka FreightFrankaCloseDrawer
FreightFrankaPickAndPlace
ShadowHandCatchOver2Underarm_Safe_finger ShadowHands ShadowHandCatchOver2Underarm_Safe_finger
ShadowHandCatchOver2Underarm_Safe_joint
ShadowHandOver_Safe_finger
ShadowHandOver_Safe_joint

note:

conda create -n safepo python=3.8
conda activate safepo
wget https://github.com/PKU-Alignment/safety-gymnasium/archive/refs/heads/main.zip
unzip main.zip
cd safety-gymnasium-main
pip install -e .

Selected Tasks

Base Environments Description Demo
ShadowHandOver These environments involve two fixed-position hands. The hand which starts with the object must find a way to hand it over to the second hand.
ShadowHandCatchOver2Underarm This environment is made up of half ShadowHandCatchUnderarm and half ShadowHandCatchOverarm, the object needs to be thrown from the vertical hand to the palm-up hand

We implement some different constraints to the base environments, including Safe finger and Safe joint. For more details, please refer to Safety-Gymnasium

Pre-requisites

To use SafePO-Baselines, you need to install environments. Please refer to Safety-Gymnasium for more details on installation. Details regarding the installation of IsaacGym can be found here.

Conda-Environment

conda create -n safepo python=3.8
conda activate safepo
# because the cuda version, we recommend you install pytorch manual.
pip install -e .

Getting Started

Efficient Commands

To verify the performance of SafePO, you can run the following:

conda create -n safepo python=3.8
conda activate safepo
make benchmark

We also support simple benchmark commands for single-agent and multi-agent algorithms:

conda create -n safepo python=3.8
conda activate safepo
make simple-benchmark

The above commands will run all algorithms in sampled environments to get a quick overview of the performance of the algorithms.

Please notice that these commands would reinstall Safety-Gymnasium from PyPI. To run Safe Isaac Gym and Safe MultiGoal, please reinstall it manully from source by:

conda activate safepo
wget https://github.com/PKU-Alignment/safety-gymnasium/archive/refs/heads/main.zip
unzip main.zip
cd safety-gymnasium-main
pip install -e .

Single-Agent

Each algorithm file is the entrance. Running ALGO.py with arguments about algorithms and environments does the training. For example, to run PPO-Lag in SafetyPointGoal1-v0 with seed 0, you can use the following command:

cd safepo/single_agent
python ppo_lag.py --task SafetyPointGoal1-v0 --seed 0

To run a benchmark parallelly, for example, you can use the following commands to run PPO-Lag, TRPO-Lag in SafetyAntVelocity-v1, SafetyHalfCheetahVelocity-v1:

cd safepo/single_agent
python benchmark.py --tasks SafetyAntVelocity-v1 SafetyHalfCheetahVelocity-v1 --algo ppo_lag trpo_lag --workers 2

Commands above will run two processes in parallel, each process will run one algorithm in one environment. The results will be saved in ./runs/.

Multi-Agent

We also provide a safe MARL algorithm benchmark on the challenging tasks of Safety-Gymnasium Safe Multi-Agent Velocity, Safe Isaac Gym and Safe MultiGoal tasks. HAPPO, MACPO, MAPPO-Lag and MAPPO have already been implemented.

To train a multi-agent algorithm:

cd safepo/multi_agent
python macpo.py --task Safety2x4AntVelocity-v0 --experiment benchmark

You can also train on Isaac Gym based environment if you have installed Isaac Gym.

cd safepo/multi_agent
python macpo.py --task ShadowHandOver_Safe_joint --experiment benchmark

Experiment Evaluation

After running the experiment, you can use the following command to plot the results:

cd safepo
python plot.py --logdir ./runs/benchmark

To evaluate the performance of the algorithm, you can use the following command:

cd safepo
python evaluate.py --benchmark-dir ./runs/benchmark

Machine Configuration

We test all algorithms and experiments on CPU: AMD Ryzen Threadripper PRO 3975WX 32-Cores and GPU: NVIDIA GeForce RTX 3090, Driver Version: 495.44. All of our experiments are run on Linux platform. If you encounter any problem in Mac or Windows, please feel free to open an issue.

Ethical and Responsible Use

SafePO aims to benefit safe RL community research, and is released under the Apache-2.0 license. Illegal usage or any violation of the license is not allowed.

PKU-Alignment Team

The Baseline is a project contributed by PKU-Alignment at Peking University. We also thank the list of contributors of the following open source repositories: Spinning Up, Bullet-Safety-Gym, Safety-Gym.