Closed stvsd1314 closed 1 year ago
Sorry, there was a clerical error on it. What I meant was why I had to divide by this term--“ (1 + penalty)”
That is a pretty good question. This is because when the penalty value is too large, the update direction will be biased. Our approach can make the training oscillate between two extremes: when penalty=0, the update is equivalent to classical reinforcement learning algorithms such as Policy Gradient or PPO, and when penalty= $+\infty$ , the update will simply minimize the cost. We will provide the performance curves of IPO on multiple environments as soon as possible to validate our ideas with experimental results. We will also consider your suggestions and conduct experiments with the settings you provided. Thank you again for your feedback, and a Pull Request to implement your ideas is also welcomed.
It seems like this issue has been resolved, and I am going to close it now. If you have any other questions, feel free to continue asking.
Required prerequisites
Questions
why the return of function "compute_adv_surrogate" in IPO is" (adv_r - penalty * adv_c) / (1 + penalty)" instead of "adv_r"? And when there are more than one constraint, how can I modify this algorithm? Thanks a lot!!