PaddleNLP/neural_machine_translation/transformer/infer.py报错

yuzunrui commented 5 years ago

PaddleNLP/neural_machine_translation/transformer/infer.py报错，请问是什么原因，谢谢？

memory_optimize is deprecated. Use CompiledProgram and Executor
W0614 19:29:48.624078 50964 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0614 19:29:48.762323 50964 device_context.cc:269] device: 0, cuDNN Version: 7.0.
W0614 19:29:48.762387 50964 device_context.cc:293] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 7.0, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version.
*** Aborted at 1560511794 (unix time) try "date -d @1560511794" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x38) received by PID 50964 (TID 0x7fd56a55c740) from PID 56; stack trace: ***
    @     0x7fd56a3216d0 (unknown)
    @     0x7fd56a539d56 _dl_relocate_object
    @     0x7fd56a5427ac dl_open_worker
    @     0x7fd56a53d914 _dl_catch_error
    @     0x7fd56a541ccb _dl_open
    @     0x7fd569979082 do_dlopen
    @     0x7fd56a53d914 _dl_catch_error
    @     0x7fd569979142 __GI___libc_dlopen_mode
    @     0x7fd569950b45 init
    @     0x7fd56a31ee70 __GI___pthread_once
    @     0x7fd569950c5c __GI___backtrace
    @     0x7fd537168b68 paddle::platform::EnforceNotMet::Init<>()
    @     0x7fd537168eb7 paddle::platform::EnforceNotMet::EnforceNotMet()
    @     0x7fd538eb3a86 paddle::platform::GpuMaxChunkSize()
    @     0x7fd538e88422 _ZSt16__once_call_implISt12_Bind_simpleIFZN6paddle6memory6legacy20GetGPUBuddyAllocatorEiEUlvE_vEEEvv
    @     0x7fd56a31ee70 __GI___pthread_once
    @     0x7fd538e87acd paddle::memory::legacy::GetGPUBuddyAllocator()
    @     0x7fd538e888f3 paddle::memory::legacy::Alloc<>()
    @     0x7fd538e88d35 paddle::memory::allocation::LegacyAllocator::AllocateImpl()
    @     0x7fd538eadfeb paddle::memory::allocation::Allocator::Allocate()
    @     0x7fd538e7c933 paddle::memory::allocation::AllocatorFacade::Alloc()
    @     0x7fd538e7ca51 paddle::memory::allocation::AllocatorFacade::AllocShared()
    @     0x7fd538aba920 paddle::memory::AllocShared()
    @     0x7fd538e4ecea paddle::framework::Tensor::mutable_data()
    @     0x7fd537e27451 paddle::operators::FillConstantKernel<>::Compute()
    @     0x7fd537e2a5f3 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators18FillConstantKernelIfEENSA_IdEENSA_IlEENSA_IiEENSA_INS7_7float16EEEEEclEPKcSJ_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
    @     0x7fd538dfa6f6 paddle::framework::OperatorWithKernel::RunImpl()
    @     0x7fd538dfae64 paddle::framework::OperatorWithKernel::RunImpl()
    @     0x7fd538df878c paddle::framework::OperatorBase::Run()
    @     0x7fd5372dd8be paddle::framework::Executor::RunPreparedContext()
    @     0x7fd5372de6ff paddle::framework::Executor::Run()
    @     0x7fd53715835e _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL18pybind11_init_coreERNS_6moduleEEUlRNS2_9framework8ExecutorERKNS6_11ProgramDescEPNS6_5ScopeEibbRKSt6vectorISsSaISsEEE97_vIS8_SB_SD_ibbSI_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES10_
Segmentation fault (core dumped)

guoshengCS commented 5 years ago

这个不像是模型问题导致，请问能稳定复现吗，能够正常运行训练或者其他模型吗？

yuzunrui commented 5 years ago

这个不像是模型问题导致，请问能稳定复现吗，能够正常运行训练或者其他模型吗？

能稳定复现。尚未尝试训练过程和其它模型。

yuzunrui commented 5 years ago

这个不像是模型问题导致，请问能稳定复现吗，能够正常运行训练或者其他模型吗？

在另一台开发机上运行infer.py，没出现这个报错，出现了另一个报错：

/home/dqa/.jumbo/lib/python2.7/site-packages/sklearn/externals/joblib/_multiprocessing_helpers.py:38: UserWarning: This platform lacks a functioning sem_open implementation, therefore, the required synchronization primitives needed will not function, see issue 3770.. joblib will operate in serial mode warnings.warn('%s. joblib will operate in serial mode' % (e,)) W0617 10:15:23.914328 46063 device_context.cc:263] Please NOTE: device: 0, CUDA Capability: 35, Driver API Version: 9.2, Runtime API Version: 9.0 W0617 10:15:23.914397 46063 device_context.cc:271] device: 0, cuDNN Version: 5.0. W0617 10:15:23.914408 46063 device_context.cc:295] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 5.0, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version. Traceback (most recent call last): File "infer.py", line 325, in fast_infer(args) File "infer.py", line 233, in fast_infer if isinstance(var, fluid.framework.Parameter) File "/home/dqa/.jumbo/lib/python2.7/site-packages/paddle/fluid/io.py", line 607, in load_vars executor.run(load_prog) File "/home/dqa/.jumbo/lib/python2.7/site-packages/paddle/fluid/executor.py", line 525, in run use_program_cache=use_program_cache) File "/home/dqa/.jumbo/lib/python2.7/site-packages/paddle/fluid/executor.py", line 591, in _run exe.run(program.desc, scope, 0, True, True, fetch_var_name) paddle.fluid.core.EnforceNotMet: Invoke operator load error. Python Callstacks: File "/home/dqa/.jumbo/lib/python2.7/site-packages/paddle/fluid/framework.py", line 1317, in append_op attrs=kwargs.get("attrs", None)) File "/home/dqa/.jumbo/lib/python2.7/site-packages/paddle/fluid/io.py", line 593, in load_vars attrs={'file_path': os.path.join(dirname, new_var.name)}) File "infer.py", line 233, in fast_infer if isinstance(var, fluid.framework.Parameter) File "infer.py", line 325, in fast_infer(args) C++ Callstacks: Cannot open file base_models/iter_100000.infer.model/fc_74.w_0 for load op at [/paddle/paddle/fluid/operators/load_op.cc:39] PaddlePaddle Call Stacks: 0 0x7fdc44705e85p void paddle::platform::EnforceNotMet::Init<char const>(char const, char const, int) + 357 1 0x7fdc44706209p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int) + 137 2 0x7fdc44dcb1d2p paddle::operators::LoadOp::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&) const + 1234 3 0x7fdc461e4045p paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&) + 341 4 0x7fdc44825912p paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext, paddle::framework::Scope, bool, bool, bool) + 226 5 0x7fdc4482785fp paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocator > const&, bool) + 143 6 0x7fdc446f648ep 7 0x7fdc447317aep 8 0x7fdcdb12a3d4p PyEval_EvalFrameEx + 25956 9 0x7fdcdb12b120p PyEval_EvalCodeEx + 2240 10 0x7fdcdb129491p PyEval_EvalFrameEx + 22049 11 0x7fdcdb12b120p PyEval_EvalCodeEx + 2240 12 0x7fdcdb129491p PyEval_EvalFrameEx + 22049 13 0x7fdcdb12b120p PyEval_EvalCodeEx + 2240 14 0x7fdcdb129491p PyEval_EvalFrameEx + 22049 15 0x7fdcdb129c46p PyEval_EvalFrameEx + 24022 16 0x7fdcdb12b120p PyEval_EvalCodeEx + 2240 17 0x7fdcdb12b232p PyEval_EvalCode + 50 18 0x7fdcdb14561cp 19 0x7fdcdb1456f0p PyRun_FileExFlags + 144 20 0x7fdcdb146bfcp PyRun_SimpleFileExFlags + 220 21 0x7fdcdb1584bcp Py_Main + 3164 22 0x318ae1ecddp __libc_start_main + 253 23 0x400659p

guoshengCS commented 5 years ago

下面这个报错应该是因为找不到文件，可以再确认下模型路径是否正确 Cannot open file base_models/iter_100000.infer.model/fc_74.w_0 for load op

yuzunrui commented 5 years ago

下面这个报错应该是因为找不到文件，可以再确认下模型路径是否正确 Cannot open file base_models/iter_100000.infer.model/fc_74.w_0 for load op

路径是正确的。有没有可能是显存的问题导致的？

yuzunrui commented 5 years ago

这个不像是模型问题导致，请问能稳定复现吗，能够正常运行训练或者其他模型吗？

运行训练程序train.py时也报类似错误。麻烦帮忙解决下吧，谢谢~

训练指令： /mnt/dqa/lihongyu04/anaconda2/bin/python -u train.py --src_vocab_fpath /mnt/dqa/yuzunrui/neural_machine_translation/transformer/wmt16_ende_data_bpe_clean/vocab_all.bpe.32000 --trg_vocab_fpath /mnt/dqa/yuzunrui/neural_machine_translation/transformer/wmt16_ende_data_bpe_clean/vocab_all.bpe.32000 --special_token '<（防github格式）s（防github格式）>' '<（防github格式）e（防github格式）>' '<（防github格式）unk（防github格式）>' --train_file_pattern /mnt/dqa/yuzunrui/neural_machine_translation/transformer/wmt16_ende_data_bpe_clean/train.tok.clean.bpe.32000.en-de --token_delimiter ' ' --use_token_batch True --batch_size 4 --sort_type pool --pool_size 200000

报错信息： 2019-06-17 11:03:23,909-INFO: before adam memory_optimize is deprecated. Use CompiledProgram and Executor 2019-06-17 11:03:46,205-INFO: local start_up: 2019-06-17 11:03:46,207-INFO: init fluid.framework.default_startup_program W0617 11:03:47.075270 92774 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0 W0617 11:03:47.082429 92774 device_context.cc:269] device: 0, cuDNN Version: 7.0. W0617 11:03:47.082461 92774 device_context.cc:293] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 7.0, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version. Aborted at 1560740631 (unix time) try "date -d @1560740631" if you are using GNU date PC: @ 0x0 (unknown) SIGSEGV (@0x38) received by PID 92774 (TID 0x7faa2b2b6740) from PID 56; stack trace: @ 0x7faa2b07b6d0 (unknown) @ 0x7faa2b293d56 _dl_relocate_object @ 0x7faa2b29c7ac dl_open_worker @ 0x7faa2b297914 _dl_catch_error @ 0x7faa2b29bccb _dl_open @ 0x7faa2a6d3082 do_dlopen @ 0x7faa2b297914 _dl_catch_error @ 0x7faa2a6d3142 GI_libc_dlopen_mode @ 0x7faa2a6aab45 init @ 0x7faa2b078e70 GI_pthread_once @ 0x7faa2a6aac5c GI___backtrace @ 0x7fa9f7ec0b68 paddle::platform::EnforceNotMet::Init<>() @ 0x7fa9f7ec0eb7 paddle::platform::EnforceNotMet::EnforceNotMet() @ 0x7fa9f9c0ba86 paddle::platform::GpuMaxChunkSize() @ 0x7fa9f9be0422 _ZSt16once_call_implISt12_Bind_simpleIFZN6paddle6memory6legacy20GetGPUBuddyAllocatorEiEUlvE_vEEEvv @ 0x7faa2b078e70 __GI___pthread_once @ 0x7fa9f9bdfacd paddle::memory::legacy::GetGPUBuddyAllocator() @ 0x7fa9f9be08f3 paddle::memory::legacy::Alloc<>() @ 0x7fa9f9be0d35 paddle::memory::allocation::LegacyAllocator::AllocateImpl() @ 0x7fa9f9c05feb paddle::memory::allocation::Allocator::Allocate() @ 0x7fa9f9bd4933 paddle::memory::allocation::AllocatorFacade::Alloc() @ 0x7fa9f9bd4a51 paddle::memory::allocation::AllocatorFacade::AllocShared() @ 0x7fa9f9812920 paddle::memory::AllocShared() @ 0x7fa9f9ba6cea paddle::framework::Tensor::mutable_data() @ 0x7fa9f8b7f451 paddle::operators::FillConstantKernel<>::Compute() @ 0x7fa9f8b825f3 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators18FillConstantKernelIfEENSA_IdEENSA_IlEENSA_IiEENSA_INS7_7float16EEEEEclEPKcSJ_iEUlS4_E_E9_M_invokeERKSt9_AnydataS4 @ 0x7fa9f9b526f6 paddle::framework::OperatorWithKernel::RunImpl() @ 0x7fa9f9b52e64 paddle::framework::OperatorWithKernel::RunImpl() @ 0x7fa9f9b5078c paddle::framework::OperatorBase::Run() @ 0x7fa9f80358be paddle::framework::Executor::RunPreparedContext() @ 0x7fa9f80366ff paddle::framework::Executor::Run() @ 0x7fa9f7eb035e _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL18pybind11_init_coreERNS_6moduleEEUlRNS2_9framework8ExecutorERKNS6_11ProgramDescEPNS6_5ScopeEibbRKSt6vectorISsSaISsEEE97_vIS8_SB_SD_ibbSI_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4FUNES10 Segmentation fault (core dumped)

guoshengCS commented 5 years ago

麻烦试下CPU的预测或训练吧，运行命令加下 use_gpu False，感觉是环境问题

yuzunrui commented 5 years ago

麻烦试下CPU的预测或训练吧，运行命令加下 use_gpu False，感觉是环境问题

用CPU，训练可以运行，预测仍然报类似的段错误。在另一台机器上，用gpu可以训练和预测。

请问接下来该如何修正这台机器的环境？谢谢

yuzunrui commented 5 years ago

下面这个报错应该是因为找不到文件，可以再确认下模型路径是否正确 Cannot open file base_models/iter_100000.infer.model/fc_74.w_0 for load op

路径正确。重新下载了模型，这个问题也仍然存在。

用自己训练的模型就能正常预测了。

JiaXiao243 commented 5 years ago

python2.7环境下，python -u infer.py --src_vocab_fpath wmt16_ende_data_bpe_clean/vocab_all.bpe.32000 --trg_vocab_fpath wmt16_ende_data_bpe_clean/vocab_all.bpe.32000 --test_file_pattern wmt16_ende_data_bpe_clean/newstest2014.tok.bpe.32000.en-de --batch_size 32 model_path base_model/iter_100000.infer.model，报错信息如下:

PaddlePaddle / models

PaddleNLP/neural_machine_translation/transformer/infer.py报错 #2414