the first loss is not exactly DPO loss

Hi, the loss explained in the paper is slightly different from the code

https://github.com/YiyangZhou/POVID/blob/5d55ce605230f5ad3889701a894a98ddca6e1534/tool/dpo_trainer.py#L616

I understand why you do this but I'm wondering which loss you actually used for training the model as many of the arguments in run_dpo.sh & run_povid.sh do not match the arguments used to train the published checkpoints. I was wondering if the config/code published has major difference with the configs code you used to train those checkpoints. We are trying to publish a survey paper on different alignment methods used in VLMS & we want to make sure our comparison is fair.

YiyangZhou / POVID

the first loss is not exactly DPO loss #3