RifleZhang / LLaVA-Hound-DPO

122 stars 18 forks source link

Why adding sft loss during dpo training? #7

Closed sty-yyj closed 5 months ago

sty-yyj commented 5 months ago

Thanks for the sharing of the great work!!!

I'm just curious why sft loss is added when calculating loss in dpo trainer, which does not exist in trl?

RifleZhang commented 5 months ago

Thanks for the comment!

We monitor the sft loss during training, just to ensure the generation capability doesn't degrade.

In the next line ,self.gamma is set to zero in our experiments, so no sft gradient is being used to update model param. Technically, it is also possible if you want to combine the two loss, but we find it was not useful in our experiment.