PKU-Alignment / omnisafe

JMLR: OmniSafe is an infrastructural framework for accelerating SafeRL research.
https://www.omnisafe.ai
Apache License 2.0
915 stars 130 forks source link

[Question] why the form of IPO algorithm is not the same as original paper(the log form) #223

Closed stvsd1314 closed 2 months ago

stvsd1314 commented 1 year ago

Required prerequisites

Questions

About the IPO algorithm, why the form is not the same as original paper(the log form)? It makes me confused. And by the way, I find that the effect of using IPO is bad than simply using penalty cost, what is the reason? Thanks !

calico-1226 commented 1 year ago

In the implementation of the IPO paper, we encountered two issues with the logarithmic form. Firstly, because reinforcement learning estimates the sum of costs from sampling to obtain the expected policy, even starting from a feasible policy, the optimization process often encounters situations where the sum of costs exceeds a threshold. This causes the numerical value in the logarithm to be less than 0, making the IPO algorithm unable to run. Secondly, when the sum of costs is very close to the threshold, the problem of gradient explosion arises due to the pathological curvature in gradient descent caused by the Hessian matrix with a large condition number.

Therefore, we made certain equivalent transformations to the theory of IPO. In situations where the constraints are violated and the gradient is too large, we replaced the gradient with an artificially given maximum value. The specific theoretical analysis is as follows.

In IPO, the loss function is 196993928-2ff8d83f-0432-43e6-b204-f57974f62bf8 whose gradient is 196994159-6e53aed0-17b0-4380-860b-2c19c5027ceb Thus, we rewrite the loss function as follows, 196994582-949d6353-c0d6-4988-a9aa-b23012aa55f3 where the parameter $\lambda_{i,\theta}$ is defined as the minimum of $\lambdai(\theta)$ and a maximum value $\lambda\text{max}$ to avoid issues with ill-conditioning. It is a parameter related to $\theta$ but detached when calculating the gradient flow. Notably, $\mathcal L^\lambda(\theta)$ and $\mathcal L(\theta)$ have the same gradient when $\lambda_{i,\theta}=\lambdai(\theta)$. Moreover, when IPO violates constraints, we can still update the policy using $L^\lambda(\theta)$ by simply setting $\lambda{i,\theta}=\lambda_\text{max}$.

zmsn-2077 commented 1 year ago

Feel free to ask to reopen this if you have more questions.

stvsd1314 commented 1 year ago

Thanks for your response! I understand what you mean. By the way, why the return of function "compute_adv_surrogate" in IPO is " (adv_r - penalty * adv_c) / (1 + penalty)" instead of "adv_r"? And when there are more than one constraint, how can I modify this algorithm? Thanks a lot!!

Xi-HHHM commented 3 months ago

Based on the equations mentioned above, the lambda should use the advantage A_c, but the code here uses J_c

penalty = self._cfgs.algo_cfgs.kappa / (self._cfgs.algo_cfgs.cost_limit - Jc + 1e-8)

Could please explain a little bit why? Thank you!

Gaiejj commented 3 months ago

This is an adjustment to the implementation of the IPO algorithm for specific environments. Since the safety cost in Safety-Gymnasium is a non-discounted value with limited steps, the estimation of A_c often cannot accurately reflect the degree of safety violations by the agent. Therefore, we use J_c instead in the code implementation.

Andrewllab commented 2 months ago

why to divide (1+penalty) for the loss calculation: return (adv_r - penalty * adv_c) / (1 + penalty) Thanks a lot.

Xi-HHHM commented 2 months ago

@Andrewllab You can find an explanation here: https://github.com/PKU-Alignment/omnisafe/issues/234

Gaiejj commented 2 months ago

Since there has been no response for a long time, we will close this issue. Please feel free to reopen it if you encounter any new problems!