Open tjruan opened 2 months ago
Does this equation represent an advantage function that combines rewards and costs?
Yes, this is an objective function that considers both reward advantage and cost advantage simultaneously.
Is this L in PPOlag replacing A(s, a) in PPO?
Sure, it is right.
If so, can you point me to where in the code the clip is in PPOlag?
The implementation of PPOLag simply replaces ( A(s, a) ) with the objective function weighted by the Lagrange multiplier. Therefore, the clipping operation can be found in the PPO implementation of omnisafe, specifically in lines 69 to 76 of the loss calculation function in PPO:
ratio = torch.exp(logp_ - logp)
ratio_cliped = torch.clamp(
ratio,
1 - self._cfgs.algo_cfgs.clip,
1 + self._cfgs.algo_cfgs.clip,
)
loss = -torch.min(ratio * adv, ratio_cliped * adv).mean()
loss -= self._cfgs.algo_cfgs.entropy_coef * distribution.entropy().mean()
Required prerequisites
Questions
Hello Omnisafe team, thank you very much for your contribution. I am experiencing some confusion and I hope you can answer it for me, I appreciate it! The original PPO algorithm uses the CLIP objective function approach. In the documentation it is mentioned that the surrogate loss function for the PPOlag algorithm is:
Does this equation represent an advantage function that combines rewards and costs? Is this L in PPOlag replacing A(s, a) in PPO?
If so, can you point me to where in the code the clip is in PPOlag? I would greatly appreciate it if you could answer my questions.