PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
11.98k stars 2.92k forks source link

[Question]: uie-base模型在昇腾服务器上训练错误 #8354

Closed wangyu1984 closed 4 months ago

wangyu1984 commented 4 months ago

请提出你的问题

paddle框架编译npu版本check成功: FLAGS(name='FLAGS_allocator_strategy', current_value='naive_best_fit', default_value='auto_growth')

I0430 15:37:53.522773 32875 tcp_utils.cc:130] Successfully connected to 127.0.0.1:60423 I0430 15:38:17.834956 32959 tcp_store.cc:293] receive shutdown event and so quit from MasterDaemon run loop PaddlePaddle works well on 8 npus. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

代码分支:develop paddle框架编译的docker镜像:registry.baidubce.com/device/paddle-npu:cann80T2-910B-ubuntu18-aarch64 npu-info: +------------------------------------------------------------------------------------------------+ | npu-smi 23.0.0 Version: 23.0.0 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ | 0 910B3 | OK | 94.4 39 0 / 0 | | 0 | 0000:C1:00.0 | 0 0 / 0 3315 / 65536 | +===========================+===============+====================================================+ | 1 910B3 | OK | 91.6 37 0 / 0 | | 0 | 0000:C2:00.0 | 0 0 / 0 3315 / 65536 | +===========================+===============+====================================================+ | 2 910B3 | OK | 92.3 38 0 / 0 | | 0 | 0000:81:00.0 | 0 0 / 0 3315 / 65536 | +===========================+===============+====================================================+ | 3 910B3 | OK | 92.6 39 0 / 0 | | 0 | 0000:82:00.0 | 0 0 / 0 3315 / 65536 |

模型训练错误日志: Traceback (most recent call last): File "/work/PaddleNLP/model_zoo/uie/finetune.py", line 262, in main() File "/work/PaddleNLP/model_zoo/uie/finetune.py", line 193, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 888, in train self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs) File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 1024, in _maybe_log_save_evaluate tr_loss_scalar = self._nested_gather(tr_loss).mean().item() File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 2544, in _nested_gather tensors = distributed_concat(tensors) File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/utils/helper.py", line 41, in distributed_concat outputtensors = [t if len(t.shape) > 0 else t.reshape([-1]) for t in output_tensors] File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/utils/helper.py", line 41, in outputtensors = [t if len(t.shape) > 0 else t.reshape([-1]) for t in output_tensors] File "/opt/py39/lib/python3.9/site-packages/decorator.py", line 232, in fun return caller(func, *(extras + args), kw) File "/opt/py39/lib/python3.9/site-packages/paddle/base/wrapped_decorator.py", line 26, in impl return wrapped_func(*args, *kwargs) File "/opt/py39/lib/python3.9/site-packages/paddle/utils/inplace_utils.py", line 45, in impl return func(args, kwargs) File "/opt/py39/lib/python3.9/site-packages/paddle/tensor/manipulation.py", line 4635, in reshape_ out = _Cops.reshape(x, shape) OSError: (External) ACL error, the error code is : 100000. (at /work/PaddleCustomDevice/backends/npu/kernels/funcs/npu_op_runner.cc:223)

wangyu1984 commented 4 months ago

启动脚本:

/bin/bash

export finetuned_model=./checkpoint/model_best nohup python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune.py --device gpu --logging_steps 10 --save_steps 100 --eval_steps 100 --seed 42 --model_name_or_path uie-base --output_dir $finetuned_model --train_path data/train.txt --dev_path data/dev.txt --max_seq_length 512 --per_device_eval_batch_size 21 --per_device_train_batch_size 32 --num_train_epochs 50 --learning_rate 1e-2 --label_names "start_positions" "end_positions" --do_train --do_eval --do_export --export_model_dir $finetuned_model --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True --save_total_limit 1 >nohup.out &