Closed Lumozz closed 1 year ago
Hi Lumozz,
Sorry for the late response, I haven't been monitoring this repo lately. I used MSE loss function from the openAI pseudocode that I linked, but you can probably use other loss functions for it as well.
batch_rtgs
is a "true" value because it was obtained through observation rather than prediction. It is not just a Q-value, but rather, max_a q(s,a), or just v(s). It will stabilize as training increases since epsilon will decrease over time, making action selection more deterministic and supposedly optimal, and PPO trains on a fresh batch of data every iteration.
Thanks very much for the tutorial but I have a question. From my understanding, critic's loss should be 'sqr(predicted valve - true value)' but from code and paper, it is
critic_loss = nn.MSELoss()(V, batch_rtgs)
V
is predicted value, but why we can see 'batch_rtgs
' is true value? It was previously seen as Q value in advantage function.A_k = batch_rtgs - V.detach()