GanjinZero / RRHF

[NIPS2023] RRHF & Wombat
792 stars 49 forks source link

[MiniDiscussion] RRHF is similar to imitation learning #4

Closed mickelliu closed 1 year ago

mickelliu commented 1 year ago

I think RRHF possesses many similarities to the concept of imitation learning:

  1. $L_{rank}$: Find the highest ranked policy or "player" and calculate the distance in between. So it is the distance to the "expert"
  2. $L_{ft}$: a cross-entropy loss, how likely the model is able to generate the expert's response.
Yuanhy1997 commented 1 year ago

In my opinion, imitation learning resembles conventional supervised fine-tuning (or SFT) more, since human-generated demonstrations fit the terms of "expert" more than model-generated responses. I think the performance "upper bound" is defined by the "expert" samples while in RRHF the "upper bound" is defined by the performance of reward models. RRHF can help the policy outperform "expert" data. I partly agree with the idea that RRHF and imitation learning have similarities. But I won't refer to the top-rank responses as "expert", especially at the early stage of tuning.

Yuanhy1997 commented 1 year ago

Sry, I kinda forget my manners. We really appreciate the discussion and your comments.

mickelliu commented 1 year ago

Thanks for following-up on the discussion, @Yuanhy1997. Apologize if I started this discussion all in a sudden, perhaps I should have at least started with a greeting! And there is no need to be defensive as I am not your paper's reviewer, just here for learning :)

tbh I'm unfamiliar with all these stuffs, but I'm not sure what is the reward model comprised of. I have a couple of questions if you don't mind:

  1. How is this "average reward score" being calculated? Does the Dahoas/gptj-rm-static ckpt return a scalar value? I never worked with this model.
  2. Is human feedback always being assigned the max score as I interpreted from Figure 1?
Yuanhy1997 commented 1 year ago

For the first question, you are right, the reward model including used Dahoas/gptj-rm-static ckpt returns a scalar value for the responses and we rank the response according to the scalar value.
For the second question, I think your "human feedback" refers to the human-written responses, and it is not always assigned to the max reward score. The reasons, in my opinion, are first the reward model is not a golden standard, it may not always distinguish the best one, second, the policy model is capable of generating responses that are better than human annotations. Also, in case there is a misunderstanding, our "human feedback" refers to the reward model which is a proxy human feedback, since including real-humans is really expensive.

mickelliu commented 1 year ago

Thanks for your information.