[Question] Return of IPO's adv_surrograte is confused

stvsd1314 commented 1 year ago

Required prerequisites

[X] I have read the documentation https://omnisafe.readthedocs.io.
[X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[X] Consider asking first in a Discussion.

Questions

why the return of function "compute_adv_surrogate" in IPO is" (adv_r - penalty * adv_c) / (1 + penalty)" instead of "adv_r"? And when there are more than one constraint, how can I modify this algorithm? Thanks a lot!!

stvsd1314 commented 1 year ago

Sorry, there was a clerical error on it. What I meant was why I had to divide by this term--“ (1 + penalty)”

Gaiejj commented 1 year ago

That is a pretty good question. This is because when the penalty value is too large, the update direction will be biased. Our approach can make the training oscillate between two extremes: when penalty=0, the update is equivalent to classical reinforcement learning algorithms such as Policy Gradient or PPO, and when penalty= $+\infty$ , the update will simply minimize the cost. We will provide the performance curves of IPO on multiple environments as soon as possible to validate our ideas with experimental results. We will also consider your suggestions and conduct experiments with the settings you provided. Thank you again for your feedback, and a Pull Request to implement your ideas is also welcomed.

calico-1226 commented 1 year ago

It seems like this issue has been resolved, and I am going to close it now. If you have any other questions, feel free to continue asking.

PKU-Alignment / omnisafe

[Question] Return of IPO's adv_surrograte is confused #234

Required prerequisites

Questions