Closed shipengai closed 2 months ago
The reason for this problem is that multiple forward propagation were used when calculating the DPO loss function. The problem will disappear if using zero3 instead of zero2 for deepspeed training.
try in run_dpo.sh set --gradient_checkpointing False,it works for me.
try in run_dpo.sh set --gradient_checkpointing False,it works for me.
I got the out of GPU-memory problem when setting --gradient_checkpointing False (single GPU training), Have you encountered this problem?
try in run_dpo.sh set --gradient_checkpointing False,it works for me.
I got the out of GPU-memory problem when setting --gradient_checkpointing False (single GPU training), Have you encountered this problem?
this needs 48GB GPU memory at least,and i set zero2 and bsz 1.
hello,when I run the code. There are some error!
File "llava/train/train_dpo.py", line 1041, in <module> train() File "llava/train/train_dpo.py", line 1019, in train trainer.train() File "/home//conda/envs/llava16/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/home//conda/envs/llava16/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home//conda/envs/llava16/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step self.accelerator.backward(loss) File "/home//conda/envs/llava16/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/home//conda/envs/llava16/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1955, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2019, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/home//conda/envs/llava16/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home//conda/envs/llava16/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home//conda/envs/llava16/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) File "/home//conda/envs/llava16/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home//conda/envs/llava16/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 865, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1377, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/home//conda/envs/llava16/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 900, in reduce_independent_p_g_buckets_and_remove_grads assert self.params_already_reduced[param_id] == False, \ AssertionError: The parameter 447 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported