ERR! AssertionError: The parameter 447 has already been reduced.

shipengai commented 2 months ago

hello,when I run the code. There are some error! File "llava/train/train_dpo.py", line 1041, in <module> train() File "llava/train/train_dpo.py", line 1019, in train trainer.train() File "/home//conda/envs/llava16/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/home//conda/envs/llava16/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home//conda/envs/llava16/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step self.accelerator.backward(loss) File "/home//conda/envs/llava16/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/home//conda/envs/llava16/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1955, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2019, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/home//conda/envs/llava16/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home//conda/envs/llava16/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home//conda/envs/llava16/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) File "/home//conda/envs/llava16/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home//conda/envs/llava16/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 865, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1377, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 900, in reduce_independent_p_g_buckets_and_remove_grads assert self.params_already_reduced[param_id] == False, \ AssertionError: The parameter 447 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported

darkpromise98 commented 2 months ago

The reason for this problem is that multiple forward propagation were used when calculating the DPO loss function. The problem will disappear if using zero3 instead of zero2 for deepspeed training.

oacoij commented 2 months ago

try in run_dpo.sh set --gradient_checkpointing False，it works for me.

darkpromise98 commented 2 months ago

try in run_dpo.sh set --gradient_checkpointing False，it works for me.

I got the out of GPU-memory problem when setting --gradient_checkpointing False (single GPU training), Have you encountered this problem?

oacoij commented 1 month ago

try in run_dpo.sh set --gradient_checkpointing False，it works for me.

I got the out of GPU-memory problem when setting --gradient_checkpointing False (single GPU training), Have you encountered this problem?

this needs 48GB GPU memory at least,and i set zero2 and bsz 1.

YiyangZhou / POVID

ERR! AssertionError: The parameter 447 has already been reduced. #12