PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.1k stars 2.94k forks source link

[Bug]: 运行cpu方式多线程运行 文本分类多分类训练,保存完模型,进行评估的时候一直报错 #8743

Closed 1205469665 closed 1 month ago

1205469665 commented 4 months ago

软件环境

- paddlepaddle: 2.6.1
- paddlenlp:  2.8.1

重复问题

错误描述

运行cpu方式多线程运行 文本分类多分类训练,保存完模型,进行评估的时候一直报错

稳定复现步骤 & 代码

运行文本分类多分类训练 训练代码如下链接: https://bgithub.xyz/PaddlePaddle/PaddleNLP/blob/release/2.8/applications/text_classification/multi_class/train.py 运行命令: python -m paddle.distributed.launch --nproc_per_node=8 --backend=gloo train.py \ --do_train \ --do_eval \ --do_export \ --model_name_or_path ernie-3.0-tiny-medium-v2-zh \ --output_dir checkpoint \ --device cpu \ --num_train_epochs 100 \ --early_stopping True \ --early_stopping_patience 5 \ --learning_rate 3e-5 \ --max_length 128 \ --per_device_eval_batch_size 32 \ --per_device_train_batch_size 32 \ --metric_for_best_model accuracy \ --load_best_model_at_end \ --logging_steps 5 \ --evaluation_strategy epoch \ --save_strategy epoch \ --save_total_limit 1 训练完模型,执行评估代码的时候一直报错。

output = trainer.predict(test_ds) 执行次块代码一直报错 错误日志如下: STDOUT: File "/home/pycharm_project/pycharm_project_464/text_classify/train/ernie/train.py", line 237, in STDOUT: main() STDOUT: File "/home/pycharm_project/pycharm_project_464/text_classify/train/ernie/train.py", line 198, in main STDOUT: output = trainer.predict(test_ds) STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 2865, in predict STDOUT: output = eval_loop( STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 2738, in evaluation_loop STDOUT: losses = self._nested_gather(paddle.tile(loss, repeat_times=[batch_size, 1])) STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 3027, in _nested_gather STDOUT: tensors = distributed_concat(tensors) STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddlenlp/trainer/utils/helper.py", line 49, in distributed_concat STDOUT: dist.all_gather(output_tensors, tensor) STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddle/distributed/communication/all_gather.py", line 68, in all_gather STDOUT: return stream.all_gather(tensor_list, tensor, group, sync_op) STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddle/distributed/communication/stream/all_gather.py", line 180, in all_gather STDOUT: return _all_gather_in_dygraph( STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddle/distributed/communication/stream/all_gather.py", line 55, in _all_gather_in_dygraph STDOUT: task = group.process_group.all_gather(tensor_list, tensor, sync_op) STDOUT: RuntimeError: [/paddle/third_party/gloo/gloo/transport/tcp/pair.cc:587] TIMEOUT self_rank = 0 pair_rank = 1 peer_str = [192.168.0.101]:19626

1205469665 commented 4 months ago
# Evaluate and tests model
if training_args.do_eval:
    if data_args.debug:
        **output = trainer.predict(test_ds)**
        log_metrics_debug(output, id2label, test_ds, data_args.bad_case_path)
    else:
        eval_metrics = trainer.evaluate()
        trainer.log_metrics("eval", eval_metrics)
1205469665 commented 4 months ago

output = trainer.predict(test_ds) 此行代码

1205469665 commented 4 months ago

使用此命令运行 python -m paddle.distributed.launch --nproc_per_node=4 \ --backend=gloo \ train.py \ --do_train \ --do_eval \ --debug \ --do_export \ --model_name_or_path ernie-3.0-tiny-medium-v2-zh \ --output_dir checkpoint \ --device cpu \ --num_train_epochs 3 \ --early_stopping True \ --early_stopping_patience 5 \ --learning_rate 3e-5 \ --max_length 128 \ --per_device_eval_batch_size 32 \ --per_device_train_batch_size 32 \ --metric_for_best_model accuracy \ --load_best_model_at_end \ --logging_steps 5 \ --evaluation_strategy epoch \ --save_strategy epoch \ --save_total_limit 1

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。