Closed ymr12 closed 1 year ago
Hi, thanks for watching.
Since training reward model and training ppo with reward model are two independent stages, I think it's no need to compare reward model loss with ppo-value(critic) loss.
But normalization should be a good idea, this usually makes the model converge faster. You can normalize reward generated with any of follwoing 2 methods:
Thanks for answering.
I mean the output of reward model is very large if no normalization, hence the value loss will be greater than the policy loss. I will take your advice to normalize the rewards for training ppo models.
Good Luck~ Looking forward to the result of your experiment ^_^
Hi, Is there any need normalization for training reward model and training ppo with reward model? If not, it seems like the loss of reward model will decrease constantly and the loss of value model is greater than that of the policy model.