Open yuzunrui opened 5 years ago
这个不像是模型问题导致,请问能稳定复现吗,能够正常运行训练或者其他模型吗?
这个不像是模型问题导致,请问能稳定复现吗,能够正常运行训练或者其他模型吗?
能稳定复现。尚未尝试训练过程和其它模型。
这个不像是模型问题导致,请问能稳定复现吗,能够正常运行训练或者其他模型吗?
在另一台开发机上运行infer.py,没出现这个报错,出现了另一个报错:
/home/dqa/.jumbo/lib/python2.7/site-packages/sklearn/externals/joblib/_multiprocessing_helpers.py:38: UserWarning: This platform lacks a functioning sem_open implementation, therefore, the required synchronization primitives needed will not function, see issue 3770.. joblib will operate in serial mode
warnings.warn('%s. joblib will operate in serial mode' % (e,))
W0617 10:15:23.914328 46063 device_context.cc:263] Please NOTE: device: 0, CUDA Capability: 35, Driver API Version: 9.2, Runtime API Version: 9.0
W0617 10:15:23.914397 46063 device_context.cc:271] device: 0, cuDNN Version: 5.0.
W0617 10:15:23.914408 46063 device_context.cc:295] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 5.0, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version.
Traceback (most recent call last):
File "infer.py", line 325, in
下面这个报错应该是因为找不到文件,可以再确认下模型路径是否正确
Cannot open file base_models/iter_100000.infer.model/fc_74.w_0 for load op
下面这个报错应该是因为找不到文件,可以再确认下模型路径是否正确
Cannot open file base_models/iter_100000.infer.model/fc_74.w_0 for load op
路径是正确的。有没有可能是显存的问题导致的?
这个不像是模型问题导致,请问能稳定复现吗,能够正常运行训练或者其他模型吗?
运行训练程序train.py时也报类似错误。麻烦帮忙解决下吧,谢谢~
训练指令: /mnt/dqa/lihongyu04/anaconda2/bin/python -u train.py --src_vocab_fpath /mnt/dqa/yuzunrui/neural_machine_translation/transformer/wmt16_ende_data_bpe_clean/vocab_all.bpe.32000 --trg_vocab_fpath /mnt/dqa/yuzunrui/neural_machine_translation/transformer/wmt16_ende_data_bpe_clean/vocab_all.bpe.32000 --special_token '<(防github格式)s(防github格式)>' '<(防github格式)e(防github格式)>' '<(防github格式)unk(防github格式)>' --train_file_pattern /mnt/dqa/yuzunrui/neural_machine_translation/transformer/wmt16_ende_data_bpe_clean/train.tok.clean.bpe.32000.en-de --token_delimiter ' ' --use_token_batch True --batch_size 4 --sort_type pool --pool_size 200000
报错信息: 2019-06-17 11:03:23,909-INFO: before adam memory_optimize is deprecated. Use CompiledProgram and Executor 2019-06-17 11:03:46,205-INFO: local start_up: 2019-06-17 11:03:46,207-INFO: init fluid.framework.default_startup_program W0617 11:03:47.075270 92774 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0 W0617 11:03:47.082429 92774 device_context.cc:269] device: 0, cuDNN Version: 7.0. W0617 11:03:47.082461 92774 device_context.cc:293] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 7.0, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version. Aborted at 1560740631 (unix time) try "date -d @1560740631" if you are using GNU date PC: @ 0x0 (unknown) SIGSEGV (@0x38) received by PID 92774 (TID 0x7faa2b2b6740) from PID 56; stack trace: @ 0x7faa2b07b6d0 (unknown) @ 0x7faa2b293d56 _dl_relocate_object @ 0x7faa2b29c7ac dl_open_worker @ 0x7faa2b297914 _dl_catch_error @ 0x7faa2b29bccb _dl_open @ 0x7faa2a6d3082 do_dlopen @ 0x7faa2b297914 _dl_catch_error @ 0x7faa2a6d3142 GI_libc_dlopen_mode @ 0x7faa2a6aab45 init @ 0x7faa2b078e70 GI_pthread_once @ 0x7faa2a6aac5c GI___backtrace @ 0x7fa9f7ec0b68 paddle::platform::EnforceNotMet::Init<>() @ 0x7fa9f7ec0eb7 paddle::platform::EnforceNotMet::EnforceNotMet() @ 0x7fa9f9c0ba86 paddle::platform::GpuMaxChunkSize() @ 0x7fa9f9be0422 _ZSt16once_call_implISt12_Bind_simpleIFZN6paddle6memory6legacy20GetGPUBuddyAllocatorEiEUlvE_vEEEvv @ 0x7faa2b078e70 __GI___pthread_once @ 0x7fa9f9bdfacd paddle::memory::legacy::GetGPUBuddyAllocator() @ 0x7fa9f9be08f3 paddle::memory::legacy::Alloc<>() @ 0x7fa9f9be0d35 paddle::memory::allocation::LegacyAllocator::AllocateImpl() @ 0x7fa9f9c05feb paddle::memory::allocation::Allocator::Allocate() @ 0x7fa9f9bd4933 paddle::memory::allocation::AllocatorFacade::Alloc() @ 0x7fa9f9bd4a51 paddle::memory::allocation::AllocatorFacade::AllocShared() @ 0x7fa9f9812920 paddle::memory::AllocShared() @ 0x7fa9f9ba6cea paddle::framework::Tensor::mutable_data() @ 0x7fa9f8b7f451 paddle::operators::FillConstantKernel<>::Compute() @ 0x7fa9f8b825f3 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators18FillConstantKernelIfEENSA_IdEENSA_IlEENSA_IiEENSA_INS7_7float16EEEEEclEPKcSJ_iEUlS4_E_E9_M_invokeERKSt9_AnydataS4 @ 0x7fa9f9b526f6 paddle::framework::OperatorWithKernel::RunImpl() @ 0x7fa9f9b52e64 paddle::framework::OperatorWithKernel::RunImpl() @ 0x7fa9f9b5078c paddle::framework::OperatorBase::Run() @ 0x7fa9f80358be paddle::framework::Executor::RunPreparedContext() @ 0x7fa9f80366ff paddle::framework::Executor::Run() @ 0x7fa9f7eb035e _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL18pybind11_init_coreERNS_6moduleEEUlRNS2_9framework8ExecutorERKNS6_11ProgramDescEPNS6_5ScopeEibbRKSt6vectorISsSaISsEEE97_vIS8_SB_SD_ibbSI_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4FUNES10 Segmentation fault (core dumped)
麻烦试下CPU的预测或训练吧,运行命令加下 use_gpu False
,感觉是环境问题
麻烦试下CPU的预测或训练吧,运行命令加下
use_gpu False
,感觉是环境问题
用CPU,训练可以运行,预测仍然报类似的段错误。 在另一台机器上,用gpu可以训练和预测。
请问接下来该如何修正这台机器的环境?谢谢
下面这个报错应该是因为找不到文件,可以再确认下模型路径是否正确
Cannot open file base_models/iter_100000.infer.model/fc_74.w_0 for load op
路径正确。 重新下载了模型,这个问题也仍然存在。
用自己训练的模型就能正常预测了。
python2.7环境下,python -u infer.py --src_vocab_fpath wmt16_ende_data_bpe_clean/vocab_all.bpe.32000 --trg_vocab_fpath wmt16_ende_data_bpe_clean/vocab_all.bpe.32000 --test_file_pattern wmt16_ende_data_bpe_clean/newstest2014.tok.bpe.32000.en-de --batch_size 32 model_path base_model/iter_100000.infer.model,报错信息如下:
PaddleNLP/neural_machine_translation/transformer/infer.py报错,请问是什么原因,谢谢?