In your implementation of the PPO loss, do you not need to collapse both prob and old_prob down to a single scalar per row, instead of a vector with a single non-zero entry? Otherwise, it seems that you flood the loss with negative numbers if the advantage is negative.
See below for a walkthrough of what I think is happening - I've split the implementation into a few extra variables and removed entropy for ease of explanation
However, my understand of PPO loss is that only the chosen action should be used in the calculation - that is, we should collapse the vector early, so that we're working with a single scalar per row:
In your implementation of the PPO loss, do you not need to collapse both
prob
andold_prob
down to a single scalar per row, instead of a vector with a single non-zero entry? Otherwise, it seems that you flood the loss with negative numbers if the advantage is negative.See below for a walkthrough of what I think is happening - I've split the implementation into a few extra variables and removed entropy for ease of explanation
Let's say we have this input
Then I calculate the following:
However, my understand of PPO loss is that only the chosen action should be used in the calculation - that is, we should collapse the vector early, so that we're working with a single scalar per row:
This way, we get the following:
Is this correct? If not, what am I misunderstanding?
Thanks for your help!
David