PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
43.9k stars 7.8k forks source link

paddleocr模型训练报错:paddle.fluid.core_avx.EnforceNotMet: [unhandled cuda error] at (/paddle/paddle/fluid/platform/nccl_helper.h:113) #743

Closed chros425 closed 3 years ago

chros425 commented 4 years ago

硬件和软件配置: centos7.8, paddlepaddle-gpu =1.7.2.post107, python=3.6.4, cuda=10.0, cudnn=7.6.5, GPU=nvidia V100 ,一台机器部署了两个V100 16G显存的卡

执行的是官方提供的icdar15的检测和识别的训练代码,报以下错误。 也尝试安装最新版的paddlepaddle-gpu=1.8+的包,一样报错,

[chh@hs-10-20-33-3 paddle_ocr]$ python3 tools/train.py -c configs/rec/rec_icdar15_train.yml 2020-09-17 15:39:13,231-INFO: {'Global': {'debug': False, 'algorithm': 'CRNN', 'use_gpu': True, 'epoch_num': 1000, 'log_smooth_window': 20, 'print_batch_step': 10, 'save_model_dir': './output/rec_CRNN', 'save_epoch_step': 300, 'eval_batch_step': 500, 'train_batch_size_per_card': 64, 'test_batch_size_per_card': 64, 'image_shape': [3, 32, 100], 'max_text_length': 25, 'character_type': 'en', 'loss_type': 'ctc', 'distort': True, 'reader_yml': './configs/rec/rec_icdar15_reader.yml', 'pretrain_weights': './pretrain_models/rec_mv3_none_bilstm_ctc/best_accuracy', 'checkpoints': None, 'save_inference_dir': None, 'infer_img': None}, 'Architecture': {'function': 'ppocr.modeling.architectures.rec_model,RecModel'}, 'Backbone': {'function': 'ppocr.modeling.backbones.rec_mobilenet_v3,MobileNetV3', 'scale': 0.5, 'model_name': 'large'}, 'Head': {'function': 'ppocr.modeling.heads.rec_ctc_head,CTCPredict', 'encoder_type': 'rnn', 'SeqRNN': {'hidden_size': 96}}, 'Loss': {'function': 'ppocr.modeling.losses.rec_ctc_loss,CTCLoss'}, 'Optimizer': {'function': 'ppocr.optimizer,AdamDecay', 'base_lr': 0.0005, 'beta1': 0.9, 'beta2': 0.999, 'decay': {'function': 'cosine_decay', 'step_each_epoch': 20, 'total_epoch': 1000}}, 'TrainReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader', 'num_workers': 4, 'img_set_dir': './train_data/ic15_data', 'label_file_path': './train_data/ic15_data/rec_gt_train.txt'}, 'EvalReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader', 'img_set_dir': './train_data/ic15_data', 'label_file_path': './train_data/ic15_data/rec_gt_test.txt'}, 'TestReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader'}} 2020-09-17 15:39:14,921-INFO: places would be ommited when DataLoader is not iterable W0917 15:39:16.041615 181513 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.0, Runtime API Version: 10.0 W0917 15:39:16.046483 181513 device_context.cc:245] device: 0, cuDNN Version: 7.6. 2020-09-17 15:39:18,023-INFO: Loading parameters from ./pretrain_models/rec_mv3_none_bilstm_ctc/best_accuracy... 2020-09-17 15:39:18,124-INFO: Finish initing model from ./pretrain_models/rec_mv3_none_bilstm_ctc/best_accuracy I0917 15:39:18.158005 181513 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 2 cards are used, so 2 programs are executed in parallel. /home/chh/anaconda3/lib/python3.6/site-packages/paddle/fluid/executor.py:789: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "tools/train.py", line 123, in main() File "tools/train.py", line 100, in main program.train_eval_rec_run(config, exe, train_info_dict, eval_info_dict) File "/home/chh/train/paddle_ocr/tools/program.py", line 345, in train_eval_rec_run return_numpy=False) File "/home/chh/anaconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 790, in run six.reraise(*sys.exc_info()) File "/home/chh/anaconda3/lib/python3.6/site-packages/six.py", line 703, in reraise raise value File "/home/chh/anaconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 785, in run use_program_cache=use_program_cache) File "/home/chh/anaconda3/lib/python3.6/site-packages/paddle/fluid/executor.py", line 840, in _run_impl program._compile(scope, self.place) File "/home/chh/anaconda3/lib/python3.6/site-packages/paddle/fluid/compiler.py", line 434, in _compile places=self._places) File "/home/chh/anaconda3/lib/python3.6/site-packages/paddle/fluid/compiler.py", line 387, in _compile_data_parallel self._exec_strategy, self._build_strategy, self._graph) paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const>(char const&&, char const, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int) 2 paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<paddle::platform::Place, std::allocator > const&, ncclUniqueId, unsigned long, unsigned long) 3 paddle::platform::NCCLCommunicator::InitFlatCtxs(std::vector<paddle::platform::Place, std::allocator > const&, std::vector<ncclUniqueId, std::allocator<ncclUniqueId> > const&, unsigned long, unsigned long) 4 paddle::framework::ParallelExecutorPrivate::InitNCCLCtxs(paddle::framework::Scope, paddle::framework::details::BuildStrategy const&) 5 paddle::framework::ParallelExecutorPrivate::InitOrGetNCCLCommunicator(paddle::framework::Scope, paddle::framework::details::BuildStrategy) 6 paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<paddle::platform::Place, std::allocator > const&, std::vector<std::string, std::allocator > const&, std::string const&, paddle::framework::Scope, std::vector<paddle::framework::Scope, std::allocator<paddle::framework::Scope> > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph)


Error Message Summary:

Error: An error occurred here. There is no accurate error hint for this error yet. We are continuously in the process of increasing hint for this kind of error check. It would be helpful if you could inform us of how this conversion went by opening a github issue. And we will resolve it with high priority.

2020-09-17 15:39:22,632-INFO: Warning in ppocr/data/rec/img_tools.py: Wrong data type.Excepted string with length between 1 and 25, but got '0'. Label is '...' 2020-09-17 15:39:23,118-INFO: Warning in ppocr/data/rec/img_tools.py: Wrong data type.Excepted string with length between 1 and 25, but got '0'. Label is '...' terminate called without an active exception W0917 15:39:23.702219 181689 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly W0917 15:39:23.702270 181689 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle W0917 15:39:23.702281 181689 init.cc:214] The detail failure signal is:

W0917 15:39:23.702317 181689 init.cc:217] Aborted at 1600328363 (unix time) try "date -d @1600328363" if you are using GNU date W0917 15:39:23.705432 181689 init.cc:217] PC: @ 0x0 (unknown) W0917 15:39:23.705549 181689 init.cc:217] SIGABRT (@0x3e80002c509) received by PID 181513 (TID 0x7fd2a582f700) from PID 181513; stack trace: W0917 15:39:23.708392 181689 init.cc:217] @ 0x7fd387ff0630 (unknown) W0917 15:39:23.711290 181689 init.cc:217] @ 0x7fd387c49387 GI_raise W0917 15:39:23.714597 181689 init.cc:217] @ 0x7fd387c4aa78 __GI_abort W0917 15:39:23.735095 181689 init.cc:217] @ 0x7fd379266b39 gnu_cxx::verbose_terminate_handler() W0917 15:39:23.741262 181689 init.cc:217] @ 0x7fd3792651fb cxxabiv1::terminate() W0917 15:39:23.745777 181689 init.cc:217] @ 0x7fd379265234 std::terminate() W0917 15:39:23.748178 181689 init.cc:217] @ 0x7fd379264ef9 gxx_personality_v0 W0917 15:39:23.761819 181689 init.cc:217] @ 0x7fd378fcb628 _Unwind_ForcedUnwind_Phase2 W0917 15:39:23.764176 181689 init.cc:217] @ 0x7fd378fcb8ed _Unwind_ForcedUnwind W0917 15:39:23.766969 181689 init.cc:217] @ 0x7fd387fef362 GI___pthread_unwind W0917 15:39:23.769738 181689 init.cc:217] @ 0x7fd387fe9ef7 pthread_exit W0917 15:39:23.770498 181689 init.cc:217] @ 0x555f280f9289 PyThread_exit_thread W0917 15:39:23.770753 181689 init.cc:217] @ 0x555f27f8b47a PyEval_RestoreThread.cold.736 W0917 15:39:23.773448 181689 init.cc:217] @ 0x7fd26d934889 pybind11::gil_scoped_release::~gil_scoped_release() W0917 15:39:23.773794 181689 init.cc:217] @ 0x7fd26d8e10e4 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL22pybind11_init_core_avxERNS_6moduleEEUlRNS2_9operators6reader22LoDTensorBlockingQueueERKSt6vectorINS2_9framework9LoDTensorESaISC_EEE60_bIS9_SG_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4FUNESY W0917 15:39:23.775674 181689 init.cc:217] @ 0x7fd26d952c41 pybind11::cpp_function::dispatcher() W0917 15:39:23.776223 181689 init.cc:217] @ 0x555f2803afd4 _PyCFunction_FastCallDict W0917 15:39:23.776592 181689 init.cc:217] @ 0x555f280c8d3e call_function W0917 15:39:23.777132 181689 init.cc:217] @ 0x555f280ed19a _PyEval_EvalFrameDefault W0917 15:39:23.777647 181689 init.cc:217] @ 0x555f280c38c8 PyEval_EvalCodeEx W0917 15:39:23.777999 181689 init.cc:217] @ 0x555f280c4456 function_call W0917 15:39:23.778533 181689 init.cc:217] @ 0x555f2803adde PyObject_Call W0917 15:39:23.779072 181689 init.cc:217] @ 0x555f280ee994 _PyEval_EvalFrameDefault W0917 15:39:23.779418 181689 init.cc:217] @ 0x555f280c27db fast_function W0917 15:39:23.779765 181689 init.cc:217] @ 0x555f280c8cc5 call_function W0917 15:39:23.780292 181689 init.cc:217] @ 0x555f280ed19a _PyEval_EvalFrameDefault W0917 15:39:23.780633 181689 init.cc:217] @ 0x555f280c27db fast_function W0917 15:39:23.780995 181689 init.cc:217] @ 0x555f280c8cc5 call_function W0917 15:39:23.781527 181689 init.cc:217] @ 0x555f280ed19a _PyEval_EvalFrameDefault W0917 15:39:23.782022 181689 init.cc:217] @ 0x555f280c2e4b _PyFunction_FastCallDict W0917 15:39:23.782505 181689 init.cc:217] @ 0x555f2803b39f _PyObject_FastCallDict W0917 15:39:23.782984 181689 init.cc:217] @ 0x555f2803fff3 _PyObject_Call_Prepend Aborted (core dumped)

在执行 fluid.install_check.run_check()时也报了一段错误:

import paddle.fluid as fluid fluid.install_check.run_check() Running Verify Paddle Program ... W0917 15:38:40.319144 181148 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.0, Runtime API Version: 10.0 W0917 15:38:40.323881 181148 device_context.cc:245] device: 0, cuDNN Version: 7.6. Your Paddle works well on SINGLE GPU or CPU. I0917 15:38:42.026564 181148 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 2 cards are used, so 2 programs are executed in parallel. /home/chh/anaconda3/lib/python3.6/site-packages/paddle/fluid/executor.py:789: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") 2020-09-17 15:38:44,634-WARNING: Your Paddle has some problem with multiple GPU. This may be caused by:

  1. There is only 1 or 0 GPU visible on your Device;
  2. No.1 or No.2 GPU or both of them are occupied now
  3. Wrong installation of NVIDIA-NCCL2, please follow instruction on https://github.com/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html

Original Error is:


C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const>(char const&&, char const, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int) 2 paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<paddle::platform::Place, std::allocator > const&, ncclUniqueId, unsigned long, unsigned long) 3 paddle::platform::NCCLCommunicator::InitFlatCtxs(std::vector<paddle::platform::Place, std::allocator > const&, std::vector<ncclUniqueId, std::allocator<ncclUniqueId> > const&, unsigned long, unsigned long) 4 paddle::framework::ParallelExecutorPrivate::InitNCCLCtxs(paddle::framework::Scope, paddle::framework::details::BuildStrategy const&) 5 paddle::framework::ParallelExecutorPrivate::InitOrGetNCCLCommunicator(paddle::framework::Scope, paddle::framework::details::BuildStrategy) 6 paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<paddle::platform::Place, std::allocator > const&, std::vector<std::string, std::allocator > const&, std::string const&, paddle::framework::Scope, std::vector<paddle::framework::Scope, std::allocator<paddle::framework::Scope> > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph)


Error Message Summary:

Error: An error occurred here. There is no accurate error hint for this error yet. We are continuously in the process of increasing hint for this kind of error check. It would be helpful if you could inform us of how this conversion went by opening a github issue. And we will resolve it with high priority.

Your Paddle is installed successfully ONLY for SINGLE GPU or CPU! Let's start deep Learning with Paddle now

看起来像是不能用多卡吗

littletomatodonkey commented 4 years ago

看着像是nccl安装问题,多卡gpu运行需要使用nccl库