Is PPO really better than SFT (in general)? under the condition of same amount of data

allenai / RL4LMs

A modular RL library to fine-tune language models to human preferences

https://rl4lms.apps.allenai.org/

Apache License 2.0

2.18k stars 191 forks source link

Is PPO really better than SFT (in general)? under the condition of same amount of data #66

Open allanj opened 10 months ago

allanj commented 10 months ago

For example, if we ask the model to generate a program, rather than simply continuation.

If we do not fine-tune them, RL does not even know what to generate I believe.

Do you have more thoughts on this?

allanj commented 10 months ago

I'm really confused that, when you run PPO without SFT, for example, in Narrative QA? How do the (quite-small) model knows it should generate an answer?