Closed eric-mitchell closed 1 year ago
Hi! You're right, to finetune only on responses one has to pass a tuple of (prompt, output) instead of a single string, as it is done in this script. However the base model https://huggingface.co/Dahoas/pythia-6B-static-sft used for https://huggingface.co/reciprocate/ppo_hh_pythia-6B was also trained with masked loss https://github.com/Dahoas/reward-modeling/blob/main/configs/base_configs/gptneox.yaml, so that's perhaps not the main reason why the model might perform worse than your baseline. Also have you found empirically that finetuning only on chosen responses versus whole samples to be better under GPT-4 eval? We could change training code here if that's the case.
It's also worth pointing out in their paper Anthropic claims fine-tuning on both prompt and response is about as good as just fine-tuning on response for this dataset. Additionally the dialogue trees for the helpful dataset are constructed by continuing with the preferred response, so I don't think any fine-tuning is being done on rejected responses.
🐛 Describe the bug
Please correct me if I'm wrong, but it looks like SFT for Anthropic simply maximizes log p(x) on the entire dialogue history, rather than only maximizing log p(y|x), where x is the dialogue history and y is the final assistant response.
See here, where the concatenation of prompt and response is passed to the trainer.
Since the "chosen" label is only meaningful for the final assistant response, if my interpretation is corrrect, SFT is fine-tuning on mostly bad examples. I observed this after evaluating the pre-trained PPO model
model = transformers.AutoModelForCausalLM.from_pretrained('reciprocate/ppo_hh_pythia-6B')
using GPT-4 as the proxy human, and found its win rate to be worse than a simple baseline that only fine-tunes on the chosen response (not the whole history).Which trlX version are you using?
No response
Additional system and package information
No response