why critic's loss is mean squared error of the predicted values with rewards-to-go.

ericyangyu / PPO-for-Beginners

A simple and well styled PPO implementation. Based on my Medium series: https://medium.com/@eyyu/coding-ppo-from-scratch-with-pytorch-part-1-4-613dfc1b14c8.

MIT License

766 stars 116 forks source link

Hi Lumozz,

Sorry for the late response, I haven't been monitoring this repo lately. I used MSE loss function from the openAI pseudocode that I linked, but you can probably use other loss functions for it as well.

batch_rtgs is a "true" value because it was obtained through observation rather than prediction. It is not just a Q-value, but rather, max_a q(s,a), or just v(s). It will stabilize as training increases since epsilon will decrease over time, making action selection more deterministic and supposedly optimal, and PPO trains on a fresh batch of data every iteration.

ericyangyu / PPO-for-Beginners

why critic's loss is mean squared error of the predicted values with rewards-to-go. #8