运行run_telechat_lora.sh脚本,其它文件未改动
处理数据时没有问题,进入微调后出现了报错
UnboundLocalError: cannot access local variable 'dim' where it is not associated with a value
Running training
Beginning of Epoch 1/1, Total Micro Batches 1000
Traceback (most recent call last):
File "/root/XBY/Telechat/deepspeed-telechat/sft/main.py", line 405, in
main()
File "/root/XBY/Telechat/deepspeed-telechat/sft/main.py", line 359, in main
model.backward(loss)
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1955, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2135, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, args)
^^^^^^^^^^^^^^^^^^^^
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, args)
^^^^^^^^^^^^^^^^^^^^
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/cuda/amp/autocast_mode.py", line 122, in decorate_bwd
return bwd(args, kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py", line 97, in backward
if dim > 2:
^^^
UnboundLocalError: cannot access local variable 'dim' where it is not associated with a value
[2024-06-25 14:32:38,467] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 438681
[2024-06-25 14:32:38,468] [ERROR] [launch.py:321:sigkill_handler] ['/home/usr/anaconda3/envs/telechat/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'datas/data_files', '--model_name_or_path', '/root/XBY/telechat-7B', '--with_loss_mask', '--per_device_train_batch_size', '1', '--max_seq_len', '4096', '--learning_rate', '3e-5', '--weight_decay', '0.0001', '--num_train_epochs', '1', '--gradient_accumulation_steps', '4', '--lr_scheduler_type', 'cosine', '--precision', 'fp16', '--warmup_proportion', '0.1', '--gradient_checkpointing', '--seed', '42', '--zero_stage', '3', '--save_steps', '10', '--deepspeed', '--lora_dim', '8', '--mark_only_lora_as_trainable', '--lora_module_name', 'self_attention.', '--output_dir', 'telechat-lora-test'] exits with return code = 1
运行run_telechat_lora.sh脚本,其它文件未改动 处理数据时没有问题,进入微调后出现了报错 UnboundLocalError: cannot access local variable 'dim' where it is not associated with a value
Running training Beginning of Epoch 1/1, Total Micro Batches 1000 Traceback (most recent call last): File "/root/XBY/Telechat/deepspeed-telechat/sft/main.py", line 405, in
main()
File "/root/XBY/Telechat/deepspeed-telechat/sft/main.py", line 359, in main
model.backward(loss)
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1955, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2135, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, args)
^^^^^^^^^^^^^^^^^^^^
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, args)
^^^^^^^^^^^^^^^^^^^^
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/cuda/amp/autocast_mode.py", line 122, in decorate_bwd
return bwd(args, kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py", line 97, in backward
if dim > 2:
^^^
UnboundLocalError: cannot access local variable 'dim' where it is not associated with a value
[2024-06-25 14:32:38,467] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 438681
[2024-06-25 14:32:38,468] [ERROR] [launch.py:321:sigkill_handler] ['/home/usr/anaconda3/envs/telechat/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'datas/data_files', '--model_name_or_path', '/root/XBY/telechat-7B', '--with_loss_mask', '--per_device_train_batch_size', '1', '--max_seq_len', '4096', '--learning_rate', '3e-5', '--weight_decay', '0.0001', '--num_train_epochs', '1', '--gradient_accumulation_steps', '4', '--lr_scheduler_type', 'cosine', '--precision', 'fp16', '--warmup_proportion', '0.1', '--gradient_checkpointing', '--seed', '42', '--zero_stage', '3', '--save_steps', '10', '--deepspeed', '--lora_dim', '8', '--mark_only_lora_as_trainable', '--lora_module_name', 'self_attention.', '--output_dir', 'telechat-lora-test'] exits with return code = 1