Open yaorong1996 opened 1 year ago
I find that the implementation in PPOAgent from line 514 in grid/toy_grid_dag.py: adv = r + vsp * (1-d) - vs is only an implementation of the delta term in PPO raw paper. It's not the full term of the advantage function.
adv = r + vsp * (1-d) - vs
Was that a misunderstanding of your code or PPO?
I find that the implementation in PPOAgent from line 514 in grid/toy_grid_dag.py:
adv = r + vsp * (1-d) - vs
is only an implementation of the delta term in PPO raw paper. It's not the full term of the advantage function.Was that a misunderstanding of your code or PPO?