We train on 8 NVIDIA A100 GPUs with a local batch size of 1 pair and gradient accumulation of 256 steps. Other experimental Settings are the same as those in the paper.
Here are some of the results I sampled during the training.
There is also something strange about the loss function during training. The training process took about nine hours.
Thank you for your impressive work.
I can't reproduce the results of DPO-SD 1.5.
We train on 8 NVIDIA A100 GPUs with a local batch size of 1 pair and gradient accumulation of 256 steps. Other experimental Settings are the same as those in the paper.
Here are some of the results I sampled during the training.
There is also something strange about the loss function during training. The training process took about nine hours.
Can you give me some advice?