Closed Gridnn closed 1 year ago
I'm not totally sure, but here are a few thoughts off the top of my head:
preference_datasets.py
with the following snippet to just load the helpful data. Read more about the splits here.
h1 = datasets.load_dataset('Anthropic/hh-rlhf', data_dir='helpful-base', split=split, cache_dir=cache_dir)
h2 = datasets.load_dataset('Anthropic/hh-rlhf', data_dir='helpful-online', split=split, cache_dir=cache_dir)
h3 = datasets.load_dataset('Anthropic/hh-rlhf', data_dir='helpful-rejection-sampled', split=split, cache_dir=cache_dir)
dataset = datasets.concatenate_datasets([h1, h2, h3])
loss.beta
. What are you using currently? If beta
is too low, the generations can start to become much lower quality.Ideally, the preference data you use for DPO should be from the same data distribution as the SFT data. If you did SFT using your own dataset (which might be very different from Anthropic) and then are doing DPO on Anthropic, the results can be unpredictable. The best case would be to take your SFT model, generate response pairs for a dataset of prompts, and collect new preference labels on that data, then run DPO on that. Then you should definitely improve on the SFT model.
@Gridnn just checking in- any progress on this issue?
@Gridnn I will close this issue for now, but please feel free to follow up/re-open if you have any additional info or questions.
I trained my custom model by DPO on hh datasets. As hh dataset is about helpful and harmless, it contains many rejected sentences for question, such as "I’m not sure.", "do you mean ......", "I'm sorry, I'm unable to help with that.".
It's fine to use these sentences to answer illegal questions, but I tried some examples and it shows that the model used these sentences to answer all of the questions.
I give 2 examples here, while old answer represents the custom model without DPO.
Could you give me some advice to avoid this problem? Thanks a lot.