LlamaFamily / Llama-Chinese

Llama中文社区,Llama3在线体验和微调模型已开放,实时汇总最新Llama3学习资料,已将所有代码更新适配Llama3,构建最好的中文Llama大模型,完全开源可商用
https://llama.family
14.01k stars 1.26k forks source link

微调过程中:RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn #242

Open Dagoli opened 1 year ago

Dagoli commented 1 year ago

[INFO|trainer.py:1712] 2023-10-19 09:44:55,247 >> Running training [INFO|trainer.py:1713] 2023-10-19 09:44:55,247 >> Num examples = 9,861 [INFO|trainer.py:1714] 2023-10-19 09:44:55,247 >> Num Epochs = 10 [INFO|trainer.py:1715] 2023-10-19 09:44:55,247 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1718] 2023-10-19 09:44:55,247 >> Total train batch size (w. parallel, distributed & accumulation) = 64 [INFO|trainer.py:1719] 2023-10-19 09:44:55,247 >> Gradient Accumulation steps = 8 [INFO|trainer.py:1720] 2023-10-19 09:44:55,247 >> Total optimization steps = 1,540 [INFO|trainer.py:1721] 2023-10-19 09:44:55,252 >> Number of trainable parameters = 19,988,480 0%| | 0/1540 [00:00<?, ?it/s][WARNING|logging.py:305] 2023-10-19 09:44:55,318 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") [WARNING|logging.py:305] 2023-10-19 09:44:55,325 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") [WARNING|logging.py:305] 2023-10-19 09:44:55,342 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") [WARNING|logging.py:305] 2023-10-19 09:44:55,346 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") [WARNING|logging.py:305] 2023-10-19 09:44:55,347 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") [WARNING|logging.py:305] 2023-10-19 09:44:55,350 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") [WARNING|logging.py:305] 2023-10-19 09:44:55,351 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") [WARNING|logging.py:305] 2023-10-19 09:44:55,355 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Traceback (most recent call last): File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in main() File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step self.accelerator.backward(loss) File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn Traceback (most recent call last): File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in main() File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step self.accelerator.backward(loss) File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn Traceback (most recent call last): File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in main() File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step self.accelerator.backward(loss) File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, *kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn Traceback (most recent call last): File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in main() File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step self.accelerator.backward(loss) File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn Traceback (most recent call last): File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in main() File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step self.accelerator.backward(loss) File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn Traceback (most recent call last): File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in main() File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step self.accelerator.backward(loss) File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, *kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn 0%| | 0/1540 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in main() File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step self.accelerator.backward(loss) File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn Traceback (most recent call last): File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in main() File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step self.accelerator.backward(loss) File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, *kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, **kwargs) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn [2023-10-19 09:44:57,212] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092520 [2023-10-19 09:44:57,257] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092521 [2023-10-19 09:44:57,413] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092522 [2023-10-19 09:44:57,590] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092523 [2023-10-19 09:44:57,632] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092524 [2023-10-19 09:44:57,650] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092525 [2023-10-19 09:44:57,650] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092526 [2023-10-19 09:44:57,668] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092527

abulice commented 1 year ago

邮件已收到。

Dagoli commented 1 year ago

@abulice 已解决 vim /usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:line1902,增加loss.requiresgrad()

forever22777 commented 11 months ago

try this https://github.com/huggingface/peft/issues/137#issuecomment-1445912413