Closed mickelliu closed 1 year ago
In my opinion, imitation learning resembles conventional supervised fine-tuning (or SFT) more, since human-generated demonstrations fit the terms of "expert" more than model-generated responses. I think the performance "upper bound" is defined by the "expert" samples while in RRHF the "upper bound" is defined by the performance of reward models. RRHF can help the policy outperform "expert" data. I partly agree with the idea that RRHF and imitation learning have similarities. But I won't refer to the top-rank responses as "expert", especially at the early stage of tuning.
Sry, I kinda forget my manners. We really appreciate the discussion and your comments.
Thanks for following-up on the discussion, @Yuanhy1997. Apologize if I started this discussion all in a sudden, perhaps I should have at least started with a greeting! And there is no need to be defensive as I am not your paper's reviewer, just here for learning :)
tbh I'm unfamiliar with all these stuffs, but I'm not sure what is the reward model comprised of. I have a couple of questions if you don't mind:
Dahoas/gptj-rm-static
ckpt return a scalar value? I never worked with this model.For the first question, you are right, the reward model including used Dahoas/gptj-rm-static ckpt returns a scalar value for the responses and we rank the response according to the scalar value.
For the second question, I think your "human feedback" refers to the human-written responses, and it is not always assigned to the max reward score. The reasons, in my opinion, are first the reward model is not a golden standard, it may not always distinguish the best one, second, the policy model is capable of generating responses that are better than human annotations. Also, in case there is a misunderstanding, our "human feedback" refers to the reward model which is a proxy human feedback, since including real-humans is really expensive.
Thanks for your information.
I think RRHF possesses many similarities to the concept of imitation learning: