PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.27k stars 5.6k forks source link

v100训练:terminate called after throwing an instance of 'paddle::platform::EnforceNotMet' #14296

Closed liushanshan07 closed 5 years ago

liushanshan07 commented 6 years ago

paddlecloud v100训练ctc ocr识别模型。出现如下错误。

terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
  what():  an illegal memory access was encountered at [/paddle/paddle/fluid/framework/details/op_handle_base.cc:37]
PaddlePaddle Call Stacks: 
0       0x7f441b971d06p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486
1       0x7f441ccb0fa2p paddle::framework::details::OpHandleBase::~OpHandleBase() + 402
2       0x7f441cc8a9e1p paddle::framework::details::FetchOpHandle::~FetchOpHandle() + 17
3       0x7f441cc4ca4ep std::vector<std::unique_ptr<paddle::framework::details::FetchOpHandle, std::default_delete<paddle::framework::details::FetchOpHandle> >, std::allocator<std::unique_ptr<paddle::framework::details::FetchOpHandle, std::default_delete<paddle::framework::details::FetchOpHandle> > > >::~vector() + 46
4       0x7f441cc4bb46p paddle::framework::details::ThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&) + 4390
5       0x7f441cc4fb37p paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&) + 391
6       0x7f441ba520b9p paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&, std::string const&) + 489
7       0x7f441b966540p
8       0x7f441b988414p pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 2596
9       0x7f44a72b4ddcp PyEval_EvalFrameEx + 19596
10      0x7f44a72b621dp PyEval_EvalCodeEx + 2061
11      0x7f44a72b44f1p PyEval_EvalFrameEx + 17313
12      0x7f44a72b621dp PyEval_EvalCodeEx + 2061
13      0x7f44a72b44f1p PyEval_EvalFrameEx + 17313
14      0x7f44a72b621dp PyEval_EvalCodeEx + 2061
15      0x7f44a72b44f1p PyEval_EvalFrameEx + 17313
16      0x7f44a72b497ep PyEval_EvalFrameEx + 18478
17      0x7f44a72b621dp PyEval_EvalCodeEx + 2061
18      0x7f44a72b6352p PyEval_EvalCode + 50
19      0x7f44a72e0f22p PyRun_FileExFlags + 146
20      0x7f44a72e2459p PyRun_SimpleFileExFlags + 217
21      0x7f44a72f7e9dp Py_Main + 3149
22      0x7f44a64f9bd5p __libc_start_main + 245
23            0x4007a1p
*** Aborted at 1541572190 (unix time) try "date -d @1541572190" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x386) received by PID 902 (TID 0x7f44a79d1700) from PID 902; stack trace: ***
    @     0x7f44a6f9f160 (unknown)
    @     0x7f44a650d3f7 __GI_raise
    @     0x7f44a650e7d8 __GI_abort
    @     0x7f4435f62c65 __gnu_cxx::__verbose_terminate_handler()
    @     0x7f4435f60e06 __cxxabiv1::__terminate()
    @     0x7f4435f5fec9 __cxa_call_terminate
    @     0x7f4435f60a7a __gxx_personality_v0
    @     0x7f4436432853 _Unwind_RaiseException_Phase2
    @     0x7f4436432beb _Unwind_RaiseException
    @     0x7f4435f61045 __cxa_throw
    @     0x7f441ccb0fc0 paddle::framework::details::OpHandleBase::~OpHandleBase()
    @     0x7f441cc8a9e1 paddle::framework::details::FetchOpHandle::~FetchOpHandle()
    @     0x7f441cc4ca4e std::vector<>::~vector()
    @     0x7f441cc4bb46 paddle::framework::details::ThreadedSSAGraphExecutor::Run()
    @     0x7f441cc4fb37 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run()
    @     0x7f441ba520b9 paddle::framework::ParallelExecutor::Run()
    @     0x7f441b966540 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL13pybind11_initEvEUlRNS2_9framework16ParallelExecutorERKSt6vectorISsSaISsEERKSsE102_vIS6_SB_SD_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESV_
    @     0x7f441b988414 pybind11::cpp_function::dispatcher()
    @     0x7f44a72b4ddc PyEval_EvalFrameEx
    @     0x7f44a72b621d PyEval_EvalCodeEx
    @     0x7f44a72b44f1 PyEval_EvalFrameEx
    @     0x7f44a72b621d PyEval_EvalCodeEx
    @     0x7f44a72b44f1 PyEval_EvalFrameEx
    @     0x7f44a72b621d PyEval_EvalCodeEx
    @     0x7f44a72b44f1 PyEval_EvalFrameEx
    @     0x7f44a72b497e PyEval_EvalFrameEx
    @     0x7f44a72b621d PyEval_EvalCodeEx
    @     0x7f44a72b6352 PyEval_EvalCode
    @     0x7f44a72e0f22 PyRun_FileExFlags
    @     0x7f44a72e2459 PyRun_SimpleFileExFlags
    @     0x7f44a72f7e9d Py_Main
    @     0x7f44a64f9bd5 __libc_start_main
/root/paddlejob/run.sh: line 307:   902 Aborted                 (core dumped) python train.py
typhoonzero commented 6 years ago

看到有类似的问题:https://github.com/PaddlePaddle/Paddle/issues/11755 可否重启下呢?

liushanshan07 commented 6 years ago

看到有类似的问题:#11755 可否重启下呢?

非常忧伤,我已经重启过很多次了。都是同样的错误。

typhoonzero commented 6 years ago

不太确定啊,可否share下 job看下呢,本地能否复现呢?

liushanshan07 commented 6 years ago

不太确定啊,可否share下 job看下呢,本地能否复现呢? http://10.90.251.16:8001/webssh/ai00/9cc06588-7aba-5871-bc06-1ea5aa1c7ceb/10.255.124.20/1022649795.job-e6c5be26213d1db0-trainer_20181107_115939/9cc93a16-0c63-5615-a9ae-a9fc487c4988/stopped/ 这个是网页终端,可以看见job的配置的。本地无法复现这个问题,因为咱本地没有v100卡。 具体路径是:/root/paddlejob/workspace/env_run/train.py

typhoonzero commented 5 years ago

Hi,试下先把ParallelExecutor fetch list去掉可否运行呢

lucywsq commented 5 years ago

您好,此issue在近三周暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持