HarderThenHarder / transformers_tasks

⭐️ NLP Algorithms with transformers lib. Supporting Text-Classification, Text-Generation, Information-Extraction, Text-Matching, RLHF, SFT etc.
https://www.zhihu.com/column/c_1451236880973426688
2.11k stars 376 forks source link

PPO中critic模型不是应当使用reward model模型吗? #73

Open zhangjian94cn opened 1 year ago

zhangjian94cn commented 1 year ago

代码中使用Value Head来实现PPO中的critic,所定义的detach_value_head函数并没有被使用,也就是说训练过程中,value head之前的主干网络的部分能力还会被用于估计value,这样合理吗? https://github.com/HarderThenHarder/transformers_tasks/blob/497811892704e81314da793523f03f5a8064417e/RLHF/trl/gpt2.py#L87

是否可以直接将此行替换成一个reward model的forward函数?

https://github.com/HarderThenHarder/transformers_tasks/blob/497811892704e81314da793523f03f5a8064417e/RLHF/trl/gpt2.py#L120

也就是在GPT2HeadWithValueModel初始化时,同时加入reward model的模型接口,这样更合理?

https://github.com/HarderThenHarder/transformers_tasks/blob/497811892704e81314da793523f03f5a8064417e/RLHF/trl/gpt2.py#L74