Closed Eureka725 closed 4 months ago
In the method description on page 3747, in the lower-left corner of the Penalized Proximal Policy Optimization for Safe Reinforcement Learning (P3O) paper, the author mentions two implementation methods of P3O and considers both methods to be very effective in implementation.
Method 1 (from P3O paper)
As shown in Algorithm 2, we increase κ at every time step, and the early stopping condition is fulflled when the distance between solutions of two adjacent steps is small enough or the current policy is out of the trust region.
Method 2 (from P3O paper)
In practice, we utilize the normalization trick that maps the advantage estimation to an approximate standard normal distribution regardless of the tasks themselves. We fnd this technique enables a fxed κ for general good results across different tasks.
Authors Statement (from P3O paper)
Experimental results show that both of above algorithms work effectively and the learning processes are stable in a wide range of κ.
OmniSafe has adopted the second implementation method.
Required prerequisites
Questions
I didn't find any update process for kappa while I was learning the p3o algorithm. Is there an update process in /omnisafe/omnisafe/algorithms/on_policy/penalty_function p3o.py?