PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.05k stars 5.54k forks source link

训练模型时 出现 [operator < read > error] #27238

Closed lagelanhai closed 1 year ago

lagelanhai commented 4 years ago

2020-09-10 16:26:35,460-INFO: {'Global': {'debug': False, 'algorithm': 'CRNN', 'use_gpu': False, 'epoch_num': 1000, 'log_smooth_window': 20, 'print_batch_step': 10, 'save_model_dir': './output/rec_CRNN', 'save_epoch_step': 300, 'eval_batch_step': 500, 'train_batch_size_per_card': 256, 'test_batch_size_per_card': 256, 'image_shape': [3, 32, 100], 'max_text_length': 25, 'character_type': 'ch', 'use_space_char': True, 'loss_type': 'ctc', 'distort': True, 'character_dict_path': './ppocr/utils/ic15_dict.txt', 'reader_yml': './configs/rec/rec_icdar15_reader.yml', 'pretrain_weights': './pretrain_models/rec_mv3_none_bilstm_ctc/best_accuracy', 'checkpoints': None, 'save_inference_dir': None, 'infer_img': None}, 'Architecture': {'function': 'ppocr.modeling.architectures.rec_model,RecModel'}, 'Backbone': {'function': 'ppocr.modeling.backbones.rec_mobilenet_v3,MobileNetV3', 'scale': 0.5, 'model_name': 'large'}, 'Head': {'function': 'ppocr.modeling.heads.rec_ctc_head,CTCPredict', 'encoder_type': 'rnn', 'SeqRNN': {'hidden_size': 96}}, 'Loss': {'function': 'ppocr.modeling.losses.rec_ctc_loss,CTCLoss'}, 'Optimizer': {'function': 'ppocr.optimizer,AdamDecay', 'base_lr': 0.0005, 'beta1': 0.9, 'beta2': 0.999, 'decay': {'function': 'cosine_decay', 'step_each_epoch': 20, 'total_epoch': 1000}}, 'TrainReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader', 'num_workers': 1, 'img_set_dir': './train_data/zhengTest', 'label_file_path': './train_data/zhengTest/rec_gt_train.txt'}, 'EvalReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader', 'img_set_dir': './train_data/zhengTest', 'label_file_path': './train_data/zhengTest/rec_gt_test.txt'}, 'TestReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader'}} 2020-09-10 16:26:35,980-INFO: If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. The Regularization[L2Decay, regularization_coeff=0.000000] in Optimizer will not take effect, and it will only be applied to other Parameters! 2020-09-10 16:26:37,497-INFO: Distort operation can only support in GPU.Distort will be set to False. 2020-09-10 16:26:37,498-INFO: places would be ommited when DataLoader is not iterable 2020-09-10 16:26:37,498-INFO: Distort operation can only support in GPU.Distort will be set to False. 2020-09-10 16:26:37,728-INFO: Loading parameters from ./pretrain_models/rec_mv3_none_bilstm_ctc/best_accuracy... 2020-09-10 16:26:37,782-WARNING: variable ctc_fc_b_attr not used 2020-09-10 16:26:37,782-WARNING: variable ctc_fc_w_attr not used 2020-09-10 16:26:37,818-INFO: Finish initing model from ./pretrain_models/rec_mv3_none_bilstm_ctc/best_accuracy !!! The CPU_NUM is not specified, you should set CPU_NUM in the environment variable list. CPU_NUM indicates that how many CPUPlace are used in the current task. And if this parameter are set as N (equal to the number of physical CPU core) the program may be faster.

export CPU_NUM=8 # for example, set CPU_NUM as number of physical CPU core which is 8.

!!! The default number of CPU_NUM=1. W0910 16:26:37.854447 23655 build_strategy.cc:170] fusion_group is not enabled for Windows/MacOS now, and only effective when running with CUDA GPU. Process Process-1: 2020-09-10 16:26:38,041-WARNING: Your reader has raised an exception! Traceback (most recent call last): File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, self._kwargs) File "/usr/local/python3/lib/python3.6/site-packages/paddle/reader/decorator.py", line 556, in _read_into_queue six.reraise(sys.exc_info()) File "/usr/local/python3/lib/python3.6/site-packages/six.py", line 703, in reraise raise value File "/usr/local/python3/lib/python3.6/site-packages/paddle/reader/decorator.py", line 549, in _read_into_queue for sample in reader(): File "/root/PaddleORC/PaddleOCR/ppocr/data/rec/dataset_traversal.py", line 324, in batch_iter_reader for outs in sample_iter_reader(): File "/root/PaddleORC/PaddleOCR/ppocr/data/rec/dataset_traversal.py", line 286, in sample_iter_reader self.num_workers)) Exception: The number of the whole data (8) is smaller than the batch_size devices_num num_workers (256) Exception in thread Thread-1: Traceback (most recent call last): File "/usr/local/python3/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/local/python3/lib/python3.6/threading.py", line 864, in run self._target(self._args, self._kwargs) File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/reader.py", line 1145, in thread_main__ six.reraise(*sys.exc_info()) File "/usr/local/python3/lib/python3.6/site-packages/six.py", line 703, in reraise raise value File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/reader.py", line 1125, in thread_main for tensors in self._tensor_reader(): File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/reader.py", line 1195, in tensor_reader_impl for slots in paddle_reader(): File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/data_feeder.py", line 506, in reader_creator__ for item in reader(): File "/usr/local/python3/lib/python3.6/site-packages/paddle/reader/decorator.py", line 572, in queue_reader raise ValueError("multiprocess reader raises an exception") ValueError: multiprocess reader raises an exception

/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "tools/train.py", line 123, in main() File "tools/train.py", line 100, in main program.train_eval_rec_run(config, exe, train_info_dict, eval_info_dict) File "/root/PaddleORC/PaddleOCR/tools/program.py", line 345, in train_eval_rec_run return_numpy=False) File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 1071, in run six.reraise(*sys.exc_info()) File "/usr/local/python3/lib/python3.6/site-packages/six.py", line 703, in reraise raise value File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 1066, in run return_merged=return_merged) File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl return_merged=return_merged) File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel tensors = exe.run(fetch_var_names, return_merged)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const, int) 2 paddle::operators::reader::BlockingQueue<std::vector<paddle::framework::LoDTensor, std::allocator > >::Receive(std::vector<paddle::framework::LoDTensor, std::allocator >) 3 paddle::operators::reader::PyReader::ReadNext(std::vector<paddle::framework::LoDTensor, std::allocator >) 4 std::_Function_handler<std::unique_ptr<std::future_base::_Result_base, std::future_base::_Result_base::_Deleter> (), std::future_base::_Task_setter<std::unique_ptr<std::future_base::_Result, std::future_base::_Result_base::_Deleter>, unsigned long> >::_M_invoke(std::_Any_data const&) 5 std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) 6 ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const


Python Call Stacks (More useful to users):

File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/framework.py", line 2610, in append_op attrs=kwargs.get("attrs", None)) File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/reader.py", line 1080, in _init_non_iterable attrs={'drop_last': self._drop_last}) File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/reader.py", line 978, in init self._init_non_iterable() File "/usr/local/python3/lib/python3.6/site-packages/paddle/fluid/reader.py", line 620, in from_generator iterable, return_list, drop_last) File "/root/PaddleORC/PaddleOCR/ppocr/modeling/architectures/rec_model.py", line 135, in create_feed iterable=False) File "/root/PaddleORC/PaddleOCR/ppocr/modeling/architectures/rec_model.py", line 185, in call image, labels, loader = self.create_feed(mode) File "/root/PaddleORC/PaddleOCR/tools/program.py", line 170, in build dataloader, outputs = model(mode=mode) File "tools/train.py", line 50, in main config, train_program, startup_program, mode='train') File "tools/train.py", line 123, in main()


Error Message Summary:

Error: Blocking queue is killed because the data reader raises an exception [Hint: Expected killed != true, but received killed:1 == true:1.] at (/paddle/paddle/fluid/operators/reader/blocking_queue.h:141) [operator < read > error]

不太清楚这是什么问题

yaoxuefeng6 commented 4 years ago

是跑哪个示例代码报错吗? 能提供下复现代码吗?

lagelanhai commented 4 years ago

是我在训练自己的数据时发生的问题,使用官方的icdar2015 数据是没有问题的

lagelanhai commented 4 years ago

并没有运行代码,只是按照 https://github.com/PaddlePaddle/PaddleOCR/blob/develop/doc/doc_ch/recognition.md 这篇文档中步骤进行的

yaoxuefeng6 commented 4 years ago

看起来是生成的输入数据格式不对,可以先检查下生成数据的正确性

lagelanhai commented 4 years ago

分别是train 和 test 的gt.txt 的内容 1 2 和文档中的是一致的啊

lagelanhai commented 4 years ago

这个是路径结构 image image

lagelanhai commented 4 years ago

这个是配置文件内容 image image