Closed kinghuin closed 5 years ago
编译的paddle的git log
paddle版本:paddle develop分支 pull https://github.com/guoshengCS/Paddle/tree/fix-gru-doc 之后在nvidia-docker中编译的
如果不pull guoshengCS的分支,可以正常执行么?是第一个batch就出错么?
请 @guoshengCS 关注下这个问题?
是否可以先试下develop的不用seq2seq api的,https://github.com/guoshengCS/Paddle/tree/fix-gru-doc 这个PR的话是修复了BasicGRUUnit的bug,这个会比seq2seq api里的GRUCell用到,之前反向跑不起来,请问GRUCell有在其他网络里验证使用吗
谢谢 @guoshengCS @wanghaoshuang 重新编译已解决问题
厂内,CUDA9,V100,centos7 paddle版本:paddle develop分支 pull https://github.com/guoshengCS/Paddle/tree/fix-gru-doc 之后在nvidia-docker中编译的 执行模型:https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/ocr_recognition 未修改模型代码 报错:
/usr/local/lib/python2.7/dist-packages/paddle/fluid/evaluator.py:72: Warning: The EditDistance is deprecated, because maintain a modified program inside evaluator cause bug easily, please use fluid.metrics.EditDistance instead. % (self.class.name, self.class.name), Warning) finish batch shuffle W1010 14:12:12.288261 6818 device_context.cc:235] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 9.0 W1010 14:12:12.293643 6818 device_context.cc:243] device: 0, cuDNN Version: 7.4. W1010 14:12:12.293686 6818 device_context.cc:269] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.5, but CUDNN version in your machine is 7.4, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version. I1010 14:12:14.791474 6818 parallel_executor.cc:421] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies I1010 14:12:14.804628 6818 build_strategy.cc:363] SeqOnlyAllReduceOps:0, num_trainers:1 I1010 14:12:14.816233 6818 parallel_executor.cc:285] Inplace strategy is enabled, when build_strategy.enable_inplace = True I1010 14:12:14.824354 6818 parallel_executor.cc:368] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0 /usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py:774: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "train.py", line 245, in
main()
File "train.py", line 241, in main
train(args)
File "train.py", line 172, in train
results = train_one_batch(data)
File "train.py", line 128, in train_one_batch
feed=get_feeder_data(data, place))
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/parallel_executor.py", line 311, in run
return_numpy=return_numpy)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 775, in run
six.reraise(*sys.exc_info())
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 770, in run
use_program_cache=use_program_cache)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 829, in _run_impl
return_numpy=return_numpy)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 669, in _run_parallel
tensors = exe.run(fetch_var_names)._move_to_list()
paddle.fluid.core_avx.EnforceNotMet:
C++ Call Stacks (More useful to developers):
0 std::cxx11::basic_string<char, std::char_traits, std::allocator > paddle::platform::GetTraceBackString<std:: cxx11::basic_string<char, std::char_traits, std::allocator > const&>(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std:: cxx11::basic_string<char, std::char_traits, std::allocator > const&, char const, int)
2 paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
3 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const, char const, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Anydata const&, paddle::framework::ExecutionContext const&)
4 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&, paddle::framework::RuntimeContext) const
5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&) const
6 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&)
7 paddle::framework::details::ComputationOpHandle::RunImpl()
8 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase)
9 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp(paddle::framework::details::OpHandleBase, std::shared_ptr<paddle::framework::BlockingQueue > const&, unsigned long)
10 std::future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::future_base::_Result_base::_Deleter> ()>, bool*)
11 std::thread::_Impl<std::_Bind_simple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1} ()> >::_M_run()
Python Call Stacks (More useful to users):
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/framework.py", line 2313, in append_op attrs=kwargs.get("attrs", None)) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/layers/nn.py", line 4180, in batch_norm "use_global_stats": use_global_stats File "/paddle/guowei/models/PaddleCV/ocr_recognition/crnn_ctc_model.py", line 49, in conv_bn_pool is_test=is_test) File "/paddle/guowei/models/PaddleCV/ocr_recognition/crnn_ctc_model.py", line 87, in ocr_convs use_cudnn=use_cudnn) File "/paddle/guowei/models/PaddleCV/ocr_recognition/crnn_ctc_model.py", line 126, in encoder_net use_cudnn=use_cudnn) File "/paddle/guowei/models/PaddleCV/ocr_recognition/crnn_ctc_model.py", line 200, in ctc_train_net use_cudnn=True if args.use_gpu else False) File "train.py", line 81, in train args, data_shape, num_classes) File "train.py", line 241, in main train(args) File "train.py", line 245, in
main()
Error Message Summary:
PaddleCheckError: CUDNN_STATUS_BAD_PARAM at [/paddle/kuke/Paddle/paddle/fluid/operators/batch_norm_op.cu:174] [operator < batch_norm > error]
请问应该如何解决?