PPO中critic模型不是应当使用reward model模型吗？ - Githubissues

HarderThenHarder / transformers_tasks

⭐️ NLP Algorithms with transformers lib. Supporting Text-Classification, Text-Generation, Information-Extraction, Text-Matching, RLHF, SFT etc.

https://www.zhihu.com/column/c_1451236880973426688

2.11k stars 376 forks source link

PPO中critic模型不是应当使用reward model模型吗？ #73

Open zhangjian94cn opened 1 year ago

zhangjian94cn commented 1 year ago

代码中使用Value Head来实现PPO中的critic，所定义的detach_value_head函数并没有被使用，也就是说训练过程中，value head之前的主干网络的部分能力还会被用于估计value，这样合理吗？ https://github.com/HarderThenHarder/transformers_tasks/blob/497811892704e81314da793523f03f5a8064417e/RLHF/trl/gpt2.py#L87

是否可以直接将此行替换成一个reward model的forward函数？

https://github.com/HarderThenHarder/transformers_tasks/blob/497811892704e81314da793523f03f5a8064417e/RLHF/trl/gpt2.py#L120

也就是在GPT2HeadWithValueModel初始化时，同时加入reward model的模型接口，这样更合理？

https://github.com/HarderThenHarder/transformers_tasks/blob/497811892704e81314da793523f03f5a8064417e/RLHF/trl/gpt2.py#L74