HarderThenHarder / transformers_tasks

⭐️ NLP Algorithms with transformers lib. Supporting Text-Classification, Text-Generation, Information-Extraction, Text-Matching, RLHF, SFT etc.
https://www.zhihu.com/column/c_1451236880973426688
2.15k stars 380 forks source link

Need normalization for reward model? #3

Closed ymr12 closed 1 year ago

ymr12 commented 1 year ago

Hi, Is there any need normalization for training reward model and training ppo with reward model? If not, it seems like the loss of reward model will decrease constantly and the loss of value model is greater than that of the policy model.

HarderThenHarder commented 1 year ago

Hi, thanks for watching.

Since training reward model and training ppo with reward model are two independent stages, I think it's no need to compare reward model loss with ppo-value(critic) loss.

But normalization should be a good idea, this usually makes the model converge faster. You can normalize reward generated with any of follwoing 2 methods:

  1. add a Sigmoid Layer at the end of Reward Model (I haven't do this in my code, because I use sentiment-classifier as reward model which output is already between 0 ~ 1).
  2. input sentence, get the reward of Reward Model, and pass the reward into a Sigmoid Function (This is suitable for training your own Reward Model with Ernie-backbone in example code).
ymr12 commented 1 year ago

Thanks for answering.

I mean the output of reward model is very large if no normalization, hence the value loss will be greater than the policy loss. I will take your advice to normalize the rewards for training ppo models.

HarderThenHarder commented 1 year ago

Good Luck~ Looking forward to the result of your experiment ^_^