PaddlePaddle / models

Officially maintained, supported by PaddlePaddle, including CV, NLP, Speech, Rec, TS, big models and so on.
Apache License 2.0
6.9k stars 2.91k forks source link

Failed to find dynamic library: libwarpctc.so ( dlopen: cannot load any more object with static TLS ) #3007

Open endy-see opened 5 years ago

endy-see commented 5 years ago

My local environment: CentOS: release 6.9 NCCL: v2.4.7 cuda: 9.0.176 cudnn: 7.3.1 Paddle: 1.5.1 Python: 3.7.3

When i start training ocr_recognition model with crnn_ctc model, paddle occured error as follow:

(paddle) [ocr_recognition]# env CUDA_VISIBLE_DEVICES=0 python train.py --train_images dataset/public_data_english/train_images --train_list dataset/public_data_english/train.list --test_images dataset/public_data_english/test_images --test_list dataset/public_data_english/test.list ----------- Configuration Arguments ----------- average_window: 0.15 batch_size: 32 eval_period: 15000 init_model: None log_period: 1000 max_average_window: 12500 min_average_window: 10000 model: crnn_ctc parallel: False profile: False save_model_dir: ./models save_model_period: 15000 skip_batch_num: 0 skip_test: False test_images: dataset/public_data_english/test_images test_list: dataset/public_data_english/test.list total_step: 720000 train_images: dataset/public_data_english/train_images train_list: dataset/public_data_english/train.list use_gpu: True /home/work/software/anaconda2/envs/paddle/lib/python3.7/site-packages/paddle/fluid/evaluator.py:71: Warning: The EditDistance is deprecated, because maintain a modified program inside evaluator cause bug easily, please use fluid.metrics.EditDistance instead. % (self.class.name, self.class.name), Warning) finish batch shuffle W0801 21:22:58.187352 37850 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 9.2, Runtime API Version: 9.0 W0801 21:22:58.192481 37850 device_context.cc:267] device: 0, cuDNN Version: 7.3. W0801 21:22:59.779482 37850 dynamic_loader.cc:140] Failed to find dynamic library: /paddle/build/third_party/install/warpctc/lib/libwarpctc.so (dlopen: cannot load any more object with static TLS) W0801 21:22:59.779705 37850 dynamic_loader.cc:109] Can not find library: libwarpctc.so. The process maybe hang. Please try to add the lib path to LD_LIBRARY_PATH. Traceback (most recent call last): File "train.py", line 222, in main() File "train.py", line 218, in main train(args) File "train.py", line 151, in train results = train_one_batch(data) File "train.py", line 112, in train_one_batch fetch_list=fetch_vars) File "/home/work/software/anaconda2/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 651, in run use_program_cache=use_program_cache) File "/home/work/software/anaconda2/envs/paddle/lib/python3.7/site-packages/paddle/fluid/executor.py", line 749, in run exe.run(program.desc, scope, 0, True, True, fetch_var_name) paddle.fluid.core_avx.EnforceNotMet: Invoke operator warpctc error. Python Callstacks: File "/home/work/software/anaconda2/envs/paddle/lib/python3.7/site-packages/paddle/fluid/framework.py", line 1771, in append_op attrs=kwargs.get("attrs", None)) File "/home/work/software/anaconda2/envs/paddle/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op return self.main_program.current_block().append_op(args, kwargs) File "/home/work/software/anaconda2/envs/paddle/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 5573, in warpctc 'use_cudnn': use_cudnn File "/home/zhaoyanmei/models/PaddleCV/ocr_recognition/crnn_ctc_model.py", line 189, in ctc_train_net input=fc_out, label=label, blank=num_classes, norm_by_times=True) File "train.py", line 61, in train args, data_shape, num_classes) File "train.py", line 218, in main train(args) File "train.py", line 222, in main() C++ Callstacks: Failed to find dynamic library: libwarpctc.so ( dlopen: cannot load any more object with static TLS ) Please specify its path correctly using following ways: Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS. For instance, issue command: export LD_LIBRARY_PATH=... Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled. at [/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:166] PaddlePaddle Call Stacks: 0 0x7fe93ff05830p void paddle::platform::EnforceNotMet::Init(char const, char const, int) + 352 1 0x7fe93ff05ba9p paddle::platform::EnforceNotMet::EnforceNotMet(std::exception_ptr::exception_ptr, char const, int) + 137 2 0x7fe941f09f9bp paddle::platform::dynload::GetWarpCTCDsoHandle() + 1835 3 0x7fe940177be9p void std::once_call_impl<std::Bind_simple<paddle::platform::dynload::DynLoad__get_warpctc_version::operator()<>()::{lambda()#1} ()> >() + 9 4 0x7fe9b196fbe0p pthread_once + 80 5 0x7fe9401809b8p paddle::operators::WarpCTCFunctorpaddle::platform::CUDADeviceContext::operator()(paddle::framework::ExecutionContext const&, float const, float, int const, int const, int const, unsigned long, unsigned long, unsigned long, float) + 136 6 0x7fe940183206p paddle::operators::WarpCTCKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const + 2390 7 0x7fe940184ab3p std::Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::WarpCTCKernel<paddle::platform::CUDADeviceContext, float> >::operator()(char const, char const, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::M_invoke(std::Anydata const&, paddle::framework::ExecutionContext const&) + 35 8 0x7fe941e6bf07p paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void_> const&, paddle::framework::RuntimeContext) const + 375 9 0x7fe941e6c2e1p paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&) const + 529 10 0x7fe941e698dcp paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void, boost::detail::variant::void> const&) + 332 11 0x7fe94009061ep paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext, paddle::framework::Scope, bool, bool, bool) + 382 12 0x7fe9400936bfp paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocatorstd::string > const&, bool) + 143 13 0x7fe93fef6ebdp 14 0x7fe93ff38166p 15 0x7fe9b1f1b6e4p _PyMethodDef_RawFastCallKeywords + 612 16 0x7fe9b1f1b801p _PyCFunction_FastCallKeywords + 33 17 0x7fe9b1f777aep _PyEval_EvalFrameDefault + 21374 18 0x7fe9b1eb84f9p _PyEval_EvalCodeWithName + 761 19 0x7fe9b1f1aa27p _PyFunction_FastCallKeywords + 903 20 0x7fe9b1f738fep _PyEval_EvalFrameDefault + 5326 21 0x7fe9b1eb84f9p _PyEval_EvalCodeWithName + 761 22 0x7fe9b1f1aa27p _PyFunction_FastCallKeywords + 903 23 0x7fe9b1f738fep _PyEval_EvalFrameDefault + 5326 24 0x7fe9b1eb8db9p _PyEval_EvalCodeWithName + 3001 25 0x7fe9b1f1aa27p _PyFunction_FastCallKeywords + 903 26 0x7fe9b1f72846p _PyEval_EvalFrameDefault + 1046 27 0x7fe9b1eb8db9p _PyEval_EvalCodeWithName + 3001 28 0x7fe9b1f1aa27p _PyFunction_FastCallKeywords + 903 29 0x7fe9b1f72846p _PyEval_EvalFrameDefault + 1046 30 0x7fe9b1f1a79bp _PyFunction_FastCallKeywords + 251 31 0x7fe9b1f72846p _PyEval_EvalFrameDefault + 1046 32 0x7fe9b1eb84f9p _PyEval_EvalCodeWithName + 761 33 0x7fe9b1eb93c4p PyEval_EvalCodeEx + 68 34 0x7fe9b1eb93ecp PyEval_EvalCode + 28 35 0x7fe9b1fd1874p 36 0x7fe9b1fdbb81p PyRun_FileExFlags + 161 37 0x7fe9b1fdbd73p PyRun_SimpleFileExFlags + 451 38 0x7fe9b1fdce5fp 39 0x7fe9b1fdcf7cp _Py_UnixMain + 60 40 0x7fe9b15c3b45p __libc_start_main + 245 41 0x7fe9b1f82122p

(paddle) [ocr_recognition]#

Can anyone help me? Thank you in advance!

wanghaoshuang commented 5 years ago

可以暂时参考这个:https://github.com/PaddlePaddle/models/issues/1360

另外,我们在持续跟进寻找更好的修复办法。

endy-see commented 5 years ago

可以暂时参考这个:#1360

另外,我们在持续跟进寻找更好的修复办法。

但是我不是通过编译的方式安装的paddle,我是直接pip安装的,下面z这个解决办法不适合我吧 image

JiabinYang commented 5 years ago

This happened in v1.4 and v1.5, looks like v1.3 doesn't have such issue

JiabinYang commented 5 years ago
cat /proc/cpuinfo

processor    : 27
vendor_id    : GenuineIntel
cpu family    : 6
model        : 85
model name    : Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz
stepping    : 4
microcode    : 0x2000043
cpu MHz        : 2000.000
cache size    : 19712 KB
physical id    : 1
siblings    : 14
core id        : 14
cpu cores    : 14
apicid        : 60
initial apicid    : 60
fpu        : yes
fpu_exception    : yes
cpuid level    : 22
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc
bogomips    : 4004.68
clflush size    : 64
cache_alignment    : 64
address sizes    : 46 bits physical, 48 bits virtual
power management:
BeyondYourself commented 4 years ago

I have tried the 1.6 and delvelop version occur the same problems like you : dynamic_loader.cc:140] Failed to find dynamic library: /paddle/build/third_party/install/warpctc/lib/libwarpctc.so