AssertionError: The parameter 447 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported

YiyangZhou / POVID

[Arxiv] Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Apache License 2.0

71 stars 3 forks source link

AssertionError: The parameter 447 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported #6

Open zkysss11235 opened 6 months ago

zkysss11235 commented 6 months ago

When I run the first stage dpo training, I get the issue in the title when the model is doing backpropagation. I use the POVID/scripts/run_dpo.sh script and only change the path of different models and datasets. This is the line that produced this error:self.accelerator.backward(loss). Can anyone help me with this? Many thanks.

YiyangZhou commented 6 months ago

When I run the first stage dpo training, I get the issue in the title when the model is doing backpropagation. I use the POVID/scripts/run_dpo.sh script and only change the path of different models and datasets. This is the line that produced this error:self.accelerator.backward(loss). Can anyone help me with this? Many thanks.

Hello, could you please provide more detailed information? I suspect the current error might be due to library version issues.

Swetha5 commented 5 months ago

Hello, thanks a lot for releasing the source code and dataset. Having the same issue for training stage 1, using the same library versions specified.

Changing from zero2 to zero3 fixed it.

Can you please confirm if you use zero2/zero3/zero3_offload for training ?

AlanWangpku commented 2 months ago

Same issue

shipengai commented 2 months ago

Same issue