ericyangyu / PPO-for-Beginners

A simple and well styled PPO implementation. Based on my Medium series: https://medium.com/@eyyu/coding-ppo-from-scratch-with-pytorch-part-1-4-613dfc1b14c8.
MIT License
766 stars 116 forks source link

why critic's loss is mean squared error of the predicted values with rewards-to-go. #8

Closed Lumozz closed 1 year ago

Lumozz commented 2 years ago

Thanks very much for the tutorial but I have a question. From my understanding, critic's loss should be 'sqr(predicted valve - true value)' but from code and paper, it is critic_loss = nn.MSELoss()(V, batch_rtgs) V is predicted value, but why we can see 'batch_rtgs' is true value? It was previously seen as Q value in advantage function. A_k = batch_rtgs - V.detach()

ericyangyu commented 2 years ago

Hi Lumozz,

Sorry for the late response, I haven't been monitoring this repo lately. I used MSE loss function from the openAI pseudocode that I linked, but you can probably use other loss functions for it as well.

batch_rtgs is a "true" value because it was obtained through observation rather than prediction. It is not just a Q-value, but rather, max_a q(s,a), or just v(s). It will stabilize as training increases since epsilon will decrease over time, making action selection more deterministic and supposedly optimal, and PPO trains on a fresh batch of data every iteration.