Closed wangyu1984 closed 4 months ago
启动脚本:
export finetuned_model=./checkpoint/model_best nohup python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune.py --device gpu --logging_steps 10 --save_steps 100 --eval_steps 100 --seed 42 --model_name_or_path uie-base --output_dir $finetuned_model --train_path data/train.txt --dev_path data/dev.txt --max_seq_length 512 --per_device_eval_batch_size 21 --per_device_train_batch_size 32 --num_train_epochs 50 --learning_rate 1e-2 --label_names "start_positions" "end_positions" --do_train --do_eval --do_export --export_model_dir $finetuned_model --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True --save_total_limit 1 >nohup.out &
请提出你的问题
paddle框架编译npu版本check成功: FLAGS(name='FLAGS_allocator_strategy', current_value='naive_best_fit', default_value='auto_growth')
I0430 15:37:53.522773 32875 tcp_utils.cc:130] Successfully connected to 127.0.0.1:60423 I0430 15:38:17.834956 32959 tcp_store.cc:293] receive shutdown event and so quit from MasterDaemon run loop PaddlePaddle works well on 8 npus. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
代码分支:develop paddle框架编译的docker镜像:registry.baidubce.com/device/paddle-npu:cann80T2-910B-ubuntu18-aarch64 npu-info: +------------------------------------------------------------------------------------------------+ | npu-smi 23.0.0 Version: 23.0.0 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ | 0 910B3 | OK | 94.4 39 0 / 0 | | 0 | 0000:C1:00.0 | 0 0 / 0 3315 / 65536 | +===========================+===============+====================================================+ | 1 910B3 | OK | 91.6 37 0 / 0 | | 0 | 0000:C2:00.0 | 0 0 / 0 3315 / 65536 | +===========================+===============+====================================================+ | 2 910B3 | OK | 92.3 38 0 / 0 | | 0 | 0000:81:00.0 | 0 0 / 0 3315 / 65536 | +===========================+===============+====================================================+ | 3 910B3 | OK | 92.6 39 0 / 0 | | 0 | 0000:82:00.0 | 0 0 / 0 3315 / 65536 |
模型训练错误日志: Traceback (most recent call last): File "/work/PaddleNLP/model_zoo/uie/finetune.py", line 262, in
main()
File "/work/PaddleNLP/model_zoo/uie/finetune.py", line 193, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 888, in train
self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 1024, in _maybe_log_save_evaluate
tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 2544, in _nested_gather
tensors = distributed_concat(tensors)
File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/utils/helper.py", line 41, in distributed_concat
outputtensors = [t if len(t.shape) > 0 else t.reshape([-1]) for t in output_tensors]
File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/utils/helper.py", line 41, in
outputtensors = [t if len(t.shape) > 0 else t.reshape([-1]) for t in output_tensors]
File "/opt/py39/lib/python3.9/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), kw)
File "/opt/py39/lib/python3.9/site-packages/paddle/base/wrapped_decorator.py", line 26, in impl
return wrapped_func(*args, *kwargs)
File "/opt/py39/lib/python3.9/site-packages/paddle/utils/inplace_utils.py", line 45, in impl
return func(args, kwargs)
File "/opt/py39/lib/python3.9/site-packages/paddle/tensor/manipulation.py", line 4635, in reshape_
out = _Cops.reshape(x, shape)
OSError: (External) ACL error, the error code is : 100000. (at /work/PaddleCustomDevice/backends/npu/kernels/funcs/npu_op_runner.cc:223)