Open allanj opened 10 months ago
For example, if we ask the model to generate a program, rather than simply continuation.
If we do not fine-tune them, RL does not even know what to generate I believe.
Do you have more thoughts on this?
I'm really confused that, when you run PPO without SFT, for example, in Narrative QA? How do the (quite-small) model knows it should generate an answer?
For example, if we ask the model to generate a program, rather than simply continuation.
If we do not fine-tune them, RL does not even know what to generate I believe.
Do you have more thoughts on this?