Closed lcy-seso closed 7 years ago
I also encountered this bug when running language_model.
I set trainer_count=4
, it will be hung when infer
for the second time.
gdb log:
(gdb) bt
#0 0x000000318b20d720 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007fffefaad23f in paddle::MultiGradientMachine::getOutArgs(std::vector<paddle::Argument, std::allocator<paddle::Argument> >*,
paddle::enumeration_wrapper::PassType) () at /home/lizhao/Paddle/paddle/gserver/gradientmachines/MultiGradientMachine.h:354
#2 0x00007fffef932ad3 in _wrap_GradientMachine_forward () at /home/lizhao/Paddle/build/paddle/api/PaddlePYTHON_wrap.cxx:22906
#3 0x00007ffff7d1e3a3 in ext_do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4331
#4 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2705
#5 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff1c2cab0, globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=4, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
at Python/ceval.c:3253
#6 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#7 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#8 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#9 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b38d30, globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=2, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
at Python/ceval.c:3253
#10 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#11 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#12 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#13 0x00007ffff7ca16b7 in gen_send_ex (gen=0x37966e0, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#14 0x00007ffff7d199ed in PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2497
#15 0x00007ffff7ca16b7 in gen_send_ex (gen=0x379e1e0, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#16 0x00007ffff7d199ed in PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2497
#17 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x37a9bb0, globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=1, kws=<value optimized out>, kwcount=1, defs=0x37a6468, defcount=1, closure=0x0)
at Python/ceval.c:3253
#18 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#19 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#20 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#21 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b26830, globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=0, kws=<value optimized out>, kwcount=3, defs=0x0, defcount=0, closure=0x0)
at Python/ceval.c:3253
#22 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#23 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#24 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#25 0x00007ffff7d1ec56 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
#26 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#27 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#28 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b28ab0, globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=0, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
at Python/ceval.c:3253
#29 0x00007ffff7d20242 in PyEval_EvalCode (co=<value optimized out>, globals=<value optimized out>, locals=<value optimized out>)
at Python/ceval.c:667
#30 0x00007ffff7d3a62c in run_mod (mod=<value optimized out>, filename=<value optimized out>, globals=0x640160, locals=0x640160,
flags=<value optimized out>, arena=<value optimized out>) at Python/pythonrun.c:1353
#31 0x00007ffff7d3a700 in PyRun_FileExFlags (fp=0x6c69c0, filename=0x7fffffffe137 "infer.py", start=<value optimized out>, globals=
0x640160, locals=0x640160, closeit=1, flags=0x7fffffffdd10) at Python/pythonrun.c:1339
#32 0x00007ffff7d3bc0c in PyRun_SimpleFileExFlags (fp=0x6c69c0, filename=0x7fffffffe137 "infer.py", closeit=1, flags=0x7fffffffdd10)
at Python/pythonrun.c:943
---Type <return> to continue, or q <return> to quit---start infer.pyq
#33 0x00007ffff7d4d4cc in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:639
#34 0x000000318ae1ecdd in __libc_start_main () from /lib64/libc.so.6
#35 0x0000000000400659 in _start ()
(gdb) f 1
#1 0x00007fffefaad23f in paddle::MultiGradientMachine::getOutArgs(std::vector<paddle::Argument, std::allocator<paddle::Argument> >*,
paddle::enumeration_wrapper::PassType) () at /home/lizhao/Paddle/paddle/gserver/gradientmachines/MultiGradientMachine.h:354
354 void waitOutArgsReady() { outArgsReadySem_.wait(); }
(gdb) l
349
350 void start();
351
352 void onPassEnd() { gradientMachine_->onPassEnd(); }
353
354 void waitOutArgsReady() { outArgsReadySem_.wait(); }
355
356 void notifyTaskReady() { taskReadySem_.post(); }
357
358 int getDeviceId() const { return deviceId_; }
(gdb) i threads
29 Thread 0x7fffa29cc700 (LWP 15404) 0x000000318aeddfc3 in poll () from /lib64/libc.so.6
28 Thread 0x7fffa33cd700 (LWP 15403) 0x000000318aee99af in accept4 () from /lib64/libc.so.6
* 1 Thread 0x7ffff7c3b700 (LWP 14790) 0x000000318b20d720 in sem_wait () from /lib64/libpthread.so.0
And the same as @lcy-seso , when I set trainer_count=4
, the first infer
result is [[ 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]]
, which is obviously wrong.
I found this issue is duplicate to https://github.com/PaddlePaddle/Paddle/issues/2534, so I close it. We can trace the problem-solving process in https://github.com/PaddlePaddle/Paddle/issues/2534.
I am running text generation by using the encoder-decoder model, here is my codes: https://github.com/lcy-seso/models/blob/refine_seq2seq/nmt_without_attention/generate.py.
I found that:
trainer_count
is set larger than 1, the generation process will be hung wheninfer
is called the second time.trainer_count=1
andtrainer_count > 1
.outputs when setting
trainer_count=1
anduse_gpu=True
goes like this:but when setting
trainer_count=4
anduse_gpu=True
, the outputs are different: