Closed sty-yyj closed 5 months ago
Thanks for the comment!
We monitor the sft loss during training, just to ensure the generation capability doesn't degrade.
In the next line ,self.gamma is set to zero in our experiments, so no sft gradient is being used to update model param. Technically, it is also possible if you want to combine the two loss, but we find it was not useful in our experiment.
Thanks for the sharing of the great work!!!
I'm just curious why sft loss is added when calculating loss in dpo trainer, which does not exist in trl?