Tele-AI / Telechat

1.67k stars 85 forks source link

UnboundLocalError: cannot access local variable 'dim' where it is not associated with a value #47

Open Cathelloya opened 6 days ago

Cathelloya commented 6 days ago

运行run_telechat_lora.sh脚本,其它文件未改动 处理数据时没有问题,进入微调后出现了报错 UnboundLocalError: cannot access local variable 'dim' where it is not associated with a value

Running training Beginning of Epoch 1/1, Total Micro Batches 1000 Traceback (most recent call last): File "/root/XBY/Telechat/deepspeed-telechat/sft/main.py", line 405, in main() File "/root/XBY/Telechat/deepspeed-telechat/sft/main.py", line 359, in main model.backward(loss) File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1955, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2135, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, args) ^^^^^^^^^^^^^^^^^^^^ File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, args) ^^^^^^^^^^^^^^^^^^^^ File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/torch/cuda/amp/autocast_mode.py", line 122, in decorate_bwd return bwd(args, kwargs) ^^^^^^^^^^^^^^^^^^^^ File "/home/usr/anaconda3/envs/telechat/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py", line 97, in backward if dim > 2: ^^^ UnboundLocalError: cannot access local variable 'dim' where it is not associated with a value [2024-06-25 14:32:38,467] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 438681 [2024-06-25 14:32:38,468] [ERROR] [launch.py:321:sigkill_handler] ['/home/usr/anaconda3/envs/telechat/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'datas/data_files', '--model_name_or_path', '/root/XBY/telechat-7B', '--with_loss_mask', '--per_device_train_batch_size', '1', '--max_seq_len', '4096', '--learning_rate', '3e-5', '--weight_decay', '0.0001', '--num_train_epochs', '1', '--gradient_accumulation_steps', '4', '--lr_scheduler_type', 'cosine', '--precision', 'fp16', '--warmup_proportion', '0.1', '--gradient_checkpointing', '--seed', '42', '--zero_stage', '3', '--save_steps', '10', '--deepspeed', '--lora_dim', '8', '--mark_only_lora_as_trainable', '--lora_module_name', 'self_attention.', '--output_dir', 'telechat-lora-test'] exits with return code = 1

Ricardo-Ping commented 3 days ago

我也是这个问题,请问解决了吗