[Bug]: 昇腾npu训练uie-base模型报错

wangyu1984 commented 6 months ago

软件环境

- paddle-custom-npu   0.0.0
- paddle2onnx         1.0.5
- paddlefsl           1.1.0
- paddlenlp           2.6.1
- paddlepaddle        0.0.0（使用develop分支源码编译镜像是：registry.baidubce.com/device/paddle-
npu:cann80T2-910B-ubuntu18-aarch64）
(py39) λ user /work/PaddleNLP/model_zoo/uie {develop} npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.0                   Version: 23.0.0                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B3               | OK            | 94.4        39                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          3315 / 65536         |
+===========================+===============+====================================================+
| 1     910B3               | OK            | 91.6        37                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           0    / 0          3315 / 65536         |
+===========================+===============+====================================================+
| 2     910B3               | OK            | 92.3        38                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          3315 / 65536         |
+===========================+===============+====================================================+
| 3     910B3               | OK            | 92.6        39                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          3315 / 65536

重复问题

[X] I have searched the existing issues

错误描述

错误日志-
Traceback (most recent call last):
  File "/work/PaddleNLP/model_zoo/uie/finetune.py", line 262, in <module>
    main()
  File "/work/PaddleNLP/model_zoo/uie/finetune.py", line 193, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 888, in train
    self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
  File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 1024, in _maybe_log_save_evaluate
    tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
  File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/trainer.py", line 2544, in _nested_gather
    tensors = distributed_concat(tensors)
  File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/utils/helper.py", line 41, in distributed_concat
    output_tensors = [t if len(t.shape) > 0 else t.reshape_([-1]) for t in output_tensors]
  File "/opt/py39/lib/python3.9/site-packages/paddlenlp/trainer/utils/helper.py", line 41, in <listcomp>
    output_tensors = [t if len(t.shape) > 0 else t.reshape_([-1]) for t in output_tensors]
  File "/opt/py39/lib/python3.9/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/opt/py39/lib/python3.9/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/opt/py39/lib/python3.9/site-packages/paddle/utils/inplace_utils.py", line 45, in __impl__
    return func(*args, **kwargs)
  File "/opt/py39/lib/python3.9/site-packages/paddle/tensor/manipulation.py", line 4635, in reshape_
    out = _C_ops.reshape_(x, shape)
OSError: (External)  ACL error, the error code is : 100000.  (at /work/PaddleCustomDevice/backends/npu/kernels/funcs/npu_op_runner.cc:223)

稳定复现步骤 & 代码

启动脚本 python -u -m paddle.distributed.launch --gpus "0,1,2,3" finetune.py --device gpu --logging_steps 10 --save_steps 100 --eval_steps 100 --seed 42 --model_name_or_path uie-base --output_dir $finetuned_model --train_path data/train.txt --dev_path data/dev.txt --max_seq_length 512 --per_device_eval_batch_size 21 --per_device_train_batch_size 32 --num_train_epochs 50 --learning_rate 1e-2 --label_names "start_positions" "end_positions" --do_train --do_eval --do_export --export_model_dir $finetuned_model --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True --save_total_limit 1

w5688414 commented 6 months ago

您好，我们人力有限，也没有硬件条件进行复现，欢迎开发者贡献。

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。

PaddlePaddle / PaddleNLP