PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.23k stars 5.58k forks source link

paddle develop编译安装后运行OCR报错 #20444

Closed kinghuin closed 5 years ago

kinghuin commented 5 years ago

厂内,CUDA9,V100,centos7 paddle版本:paddle develop分支 pull https://github.com/guoshengCS/Paddle/tree/fix-gru-doc 之后在nvidia-docker中编译的 执行模型:https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/ocr_recognition 未修改模型代码 报错:


/usr/local/lib/python2.7/dist-packages/paddle/fluid/evaluator.py:72: Warning: The EditDistance is deprecated, because maintain a modified program inside evaluator cause bug easily, please use fluid.metrics.EditDistance instead. % (self.class.name, self.class.name), Warning) finish batch shuffle W1010 14:12:12.288261 6818 device_context.cc:235] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 9.0 W1010 14:12:12.293643 6818 device_context.cc:243] device: 0, cuDNN Version: 7.4. W1010 14:12:12.293686 6818 device_context.cc:269] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.5, but CUDNN version in your machine is 7.4, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version. I1010 14:12:14.791474 6818 parallel_executor.cc:421] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies I1010 14:12:14.804628 6818 build_strategy.cc:363] SeqOnlyAllReduceOps:0, num_trainers:1 I1010 14:12:14.816233 6818 parallel_executor.cc:285] Inplace strategy is enabled, when build_strategy.enable_inplace = True I1010 14:12:14.824354 6818 parallel_executor.cc:368] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0 /usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py:774: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "train.py", line 245, in main() File "train.py", line 241, in main train(args) File "train.py", line 172, in train results = train_one_batch(data) File "train.py", line 128, in train_one_batch feed=get_feeder_data(data, place)) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/parallel_executor.py", line 311, in run return_numpy=return_numpy) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 775, in run six.reraise(*sys.exc_info()) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 770, in run use_program_cache=use_program_cache) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 829, in _run_impl return_numpy=return_numpy) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 669, in _run_parallel tensors = exe.run(fetch_var_names)._move_to_list() paddle.fluid.core_avx.EnforceNotMet:


C++ Call Stacks (More useful to developers):

0 std::cxx11::basic_string<char, std::char_traits, std::allocator > paddle::platform::GetTraceBackString<std::cxx11::basic_string<char, std::char_traits, std::allocator > const&>(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, char const*, int) 1 paddle::platform::EnforceNotMet::EnforceNotMet(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, char const, int) 2 paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const 3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const, char const, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Anydata const&, paddle::framework::ExecutionContext const&) 4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&, paddle::framework::RuntimeContext) const 5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&) const 6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&) 7 paddle::framework::details::ComputationOpHandle::RunImpl() 8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase) 9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long) 10 std::future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::future_base::_Result_base::_Deleter> ()>, bool*) 11 std::thread::_Impl<std::_Bind_simple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1} ()> >::_M_run()


Python Call Stacks (More useful to users):

File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/framework.py", line 2313, in append_op attrs=kwargs.get("attrs", None)) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/layers/nn.py", line 4180, in batch_norm "use_global_stats": use_global_stats File "/paddle/guowei/models/PaddleCV/ocr_recognition/crnn_ctc_model.py", line 49, in conv_bn_pool is_test=is_test) File "/paddle/guowei/models/PaddleCV/ocr_recognition/crnn_ctc_model.py", line 87, in ocr_convs use_cudnn=use_cudnn) File "/paddle/guowei/models/PaddleCV/ocr_recognition/crnn_ctc_model.py", line 126, in encoder_net use_cudnn=use_cudnn) File "/paddle/guowei/models/PaddleCV/ocr_recognition/crnn_ctc_model.py", line 200, in ctc_train_net use_cudnn=True if args.use_gpu else False) File "train.py", line 81, in train args, data_shape, num_classes) File "train.py", line 241, in main train(args) File "train.py", line 245, in main()


Error Message Summary:

PaddleCheckError: CUDNN_STATUS_BAD_PARAM at [/paddle/kuke/Paddle/paddle/fluid/operators/batch_norm_op.cu:174] [operator < batch_norm > error]

请问应该如何解决?

kinghuin commented 5 years ago

编译的paddle的git log

image

wanghaoshuang commented 5 years ago

paddle版本:paddle develop分支 pull https://github.com/guoshengCS/Paddle/tree/fix-gru-doc 之后在nvidia-docker中编译的

如果不pull guoshengCS的分支,可以正常执行么?是第一个batch就出错么?

wanghaoshuang commented 5 years ago

请 @guoshengCS 关注下这个问题?

guoshengCS commented 5 years ago

是否可以先试下develop的不用seq2seq api的,https://github.com/guoshengCS/Paddle/tree/fix-gru-doc 这个PR的话是修复了BasicGRUUnit的bug,这个会比seq2seq api里的GRUCell用到,之前反向跑不起来,请问GRUCell有在其他网络里验证使用吗

kinghuin commented 5 years ago

谢谢 @guoshengCS @wanghaoshuang 重新编译已解决问题