Closed 1205469665 closed 1 month ago
# Evaluate and tests model
if training_args.do_eval:
if data_args.debug:
**output = trainer.predict(test_ds)**
log_metrics_debug(output, id2label, test_ds, data_args.bad_case_path)
else:
eval_metrics = trainer.evaluate()
trainer.log_metrics("eval", eval_metrics)
output = trainer.predict(test_ds) 此行代码
使用此命令运行 python -m paddle.distributed.launch --nproc_per_node=4 \ --backend=gloo \ train.py \ --do_train \ --do_eval \ --debug \ --do_export \ --model_name_or_path ernie-3.0-tiny-medium-v2-zh \ --output_dir checkpoint \ --device cpu \ --num_train_epochs 3 \ --early_stopping True \ --early_stopping_patience 5 \ --learning_rate 3e-5 \ --max_length 128 \ --per_device_eval_batch_size 32 \ --per_device_train_batch_size 32 \ --metric_for_best_model accuracy \ --load_best_model_at_end \ --logging_steps 5 \ --evaluation_strategy epoch \ --save_strategy epoch \ --save_total_limit 1
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。
软件环境
重复问题
错误描述
稳定复现步骤 & 代码
运行文本分类多分类训练 训练代码如下链接: https://bgithub.xyz/PaddlePaddle/PaddleNLP/blob/release/2.8/applications/text_classification/multi_class/train.py 运行命令: python -m paddle.distributed.launch --nproc_per_node=8 --backend=gloo train.py \ --do_train \ --do_eval \ --do_export \ --model_name_or_path ernie-3.0-tiny-medium-v2-zh \ --output_dir checkpoint \ --device cpu \ --num_train_epochs 100 \ --early_stopping True \ --early_stopping_patience 5 \ --learning_rate 3e-5 \ --max_length 128 \ --per_device_eval_batch_size 32 \ --per_device_train_batch_size 32 \ --metric_for_best_model accuracy \ --load_best_model_at_end \ --logging_steps 5 \ --evaluation_strategy epoch \ --save_strategy epoch \ --save_total_limit 1 训练完模型,执行评估代码的时候一直报错。
output = trainer.predict(test_ds) 执行次块代码一直报错 错误日志如下: STDOUT: File "/home/pycharm_project/pycharm_project_464/text_classify/train/ernie/train.py", line 237, in
STDOUT: main()
STDOUT: File "/home/pycharm_project/pycharm_project_464/text_classify/train/ernie/train.py", line 198, in main
STDOUT: output = trainer.predict(test_ds)
STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 2865, in predict
STDOUT: output = eval_loop(
STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 2738, in evaluation_loop
STDOUT: losses = self._nested_gather(paddle.tile(loss, repeat_times=[batch_size, 1]))
STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 3027, in _nested_gather
STDOUT: tensors = distributed_concat(tensors)
STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddlenlp/trainer/utils/helper.py", line 49, in distributed_concat
STDOUT: dist.all_gather(output_tensors, tensor)
STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddle/distributed/communication/all_gather.py", line 68, in all_gather
STDOUT: return stream.all_gather(tensor_list, tensor, group, sync_op)
STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddle/distributed/communication/stream/all_gather.py", line 180, in all_gather
STDOUT: return _all_gather_in_dygraph(
STDOUT: File "/root/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddle/distributed/communication/stream/all_gather.py", line 55, in _all_gather_in_dygraph
STDOUT: task = group.process_group.all_gather(tensor_list, tensor, sync_op)
STDOUT: RuntimeError: [/paddle/third_party/gloo/gloo/transport/tcp/pair.cc:587] TIMEOUT self_rank = 0 pair_rank = 1 peer_str = [192.168.0.101]:19626