eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)
Apache License 2.0
2.18k stars 180 forks source link

Question about average_log_prob #48

Open LSX-Sneakerprogrammer opened 1 year ago

LSX-Sneakerprogrammer commented 1 year ago

Hi, I see there is a bool variable in _get_batch_logps of trainers.py to control whether get the average log probability or not. And I have two questions.

  1. Did you do experiments on this to see which one performs better?
  2. If I choose to get average log probability, I consider the pad_to_length function needed to turn off. Is that right?

Hope you could help me on these questions, thanks a lot!

alex-ht commented 1 year ago
  1. If I choose to get average log probability, I consider the pad_to_length function needed to turn off. Is that right?

No. Padded tokens will not be counted. see here https://github.com/eric-mitchell/direct-preference-optimization/blob/main/trainers.py#L96C5-L104

alex-ht commented 1 year ago
  1. Did you do experiments on this to see which one performs better?

~~I have already tried specifying average_log_prob=True, but the beta value needs a adjustment. For example, if the sum of logits is divided by the average length of 100 tokens roughly, then the beta needs to be increased by 100 times.~~

average_log_prob=False is better

LSX-Sneakerprogrammer commented 11 months ago
  1. Did you do experiments on this to see which one performs better?

~I have already tried specifying average_log_prob=True, but the beta value needs a adjustment. For example, if the sum of logits is divided by the average length of 100 tokens roughly, then the beta needs to be increased by 100 times.~

average_log_prob=False is better

Thanks for your reply! I try average_log_prob=False and it seems the model is more likely to generate long responses compared to original. I want to avoid this problem and try average_log_prob=True, but after training, the model turns to generate with repeat words. Have you meet this problem and know how to solve it? Thanks a lot!

dblakely commented 10 months ago

I ran into the same issue a few months ago and didn't have any success with average_log_prob=True -- the model became very degenerative. Ultimately I left average_log_prob=False and had to add some extra tricks to keep DPO from teaching the model to write very long responses.

longbowzhang commented 10 months ago

Hi all, Could somebody please explain to me the reason why average_log_prob=False make model to generate longer responses? Any hints/clarifications are appreciated.

yata0 commented 8 months ago

Hi all, Could somebody please explain to me the reason why average_log_prob=False make model to generate longer responses? Any hints/clarifications are appreciated.

I've noticed that the model tends to generate longer responses as training progresses. I suspect that setting average_log_prob=True might slow down this process compared to when it's set to False. @longbowzhang , have you not encountered this issue when you've set it to True?

yata0 commented 8 months ago

I ran into the same issue a few months ago and didn't have any success with average_log_prob=True -- the model became very degenerative. Ultimately I left average_log_prob=False and had to add some extra tricks to keep DPO from teaching the model to write very long responses.

@dblakely Could you please share " extra tricks"?

dblakely commented 8 months ago

Hey @yata0, the author mentioned some ideas here and I tried each of those 4 suggestions. All of them helped to some extent. To "normalize" the data lengths, I simply dropped a bunch of the longest positive examples from my dataset to bring the length distribution of positives and negatives closer together (a big part of the problem in my case was simply that the positive examples in my dataset were on average a fair amount longer than the negatives and DPO was over-optimizing that trait).

yata0 commented 8 months ago

Hey @yata0, the author mentioned some ideas here and I tried each of those 4 suggestions. All of them helped to some extent. To "normalize" the data lengths, I simply dropped a bunch of the longest positive examples from my dataset to bring the length distribution of positives and negatives closer together (a big part of the problem in my case was simply that the positive examples in my dataset were on average a fair amount longer than the negatives and DPO was over-optimizing that trait).

Thanks!