inferring by using multi-thread will be hung and the results are not right

I am running text generation by using the encoder-decoder model, here is my codes: https://github.com/lcy-seso/models/blob/refine_seq2seq/nmt_without_attention/generate.py.

I found that:

if trainer_count is set larger than 1, the generation process will be hung when infer is called the second time.
the prediction results are different between trainer_count=1 and trainer_count > 1.
This bug occurs both in CPU and GPU mode.

outputs when setting trainer_count=1 and use_gpu=True goes like this:

Les <unk> se <unk> au sujet de la <unk> des <unk> alors que de <unk> <unk> sont en jeu
-119.7212       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> <unk> . <e>
-170.2804       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> <unk> , <unk> <unk> <unk>
-170.3101       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the
-170.5066       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> <unk> <unk>
-170.5434       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of <unk>

but when setting trainer_count=4 and use_gpu=True, the outputs are different:

Les <unk> se <unk> au sujet de la <unk> des <unk> alors que de <unk> <unk> sont en jeu
-8.0064 <e>
-16.0127        <s> <e>
-16.0127        the <e>
-16.0127        , <e>
-16.0127        <unk> <e>

I also encountered this bug when running language_model.

I set trainer_count=4, it will be hung when infer for the second time.

gdb log:

(gdb) bt
#0  0x000000318b20d720 in sem_wait () from /lib64/libpthread.so.0
#1  0x00007fffefaad23f in paddle::MultiGradientMachine::getOutArgs(std::vector<paddle::Argument, std::allocator<paddle::Argument> >*,
 paddle::enumeration_wrapper::PassType) () at /home/lizhao/Paddle/paddle/gserver/gradientmachines/MultiGradientMachine.h:354
#2  0x00007fffef932ad3 in _wrap_GradientMachine_forward () at /home/lizhao/Paddle/build/paddle/api/PaddlePYTHON_wrap.cxx:22906
#3  0x00007ffff7d1e3a3 in ext_do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4331
#4  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2705
#5  0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff1c2cab0, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=4, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#6  0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#7  call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#8  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#9  0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b38d30, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=2, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#10 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#11 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#12 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#13 0x00007ffff7ca16b7 in gen_send_ex (gen=0x37966e0, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#14 0x00007ffff7d199ed in PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2497
#15 0x00007ffff7ca16b7 in gen_send_ex (gen=0x379e1e0, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#16 0x00007ffff7d199ed in PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2497
#17 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x37a9bb0, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=1, kws=<value optimized out>, kwcount=1, defs=0x37a6468, defcount=1, closure=0x0)
    at Python/ceval.c:3253
#18 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#19 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#20 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#21 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b26830, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=0, kws=<value optimized out>, kwcount=3, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#22 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#23 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#24 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#25 0x00007ffff7d1ec56 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
#26 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#27 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#28 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b28ab0, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=0, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#29 0x00007ffff7d20242 in PyEval_EvalCode (co=<value optimized out>, globals=<value optimized out>, locals=<value optimized out>)
    at Python/ceval.c:667
#30 0x00007ffff7d3a62c in run_mod (mod=<value optimized out>, filename=<value optimized out>, globals=0x640160, locals=0x640160,
    flags=<value optimized out>, arena=<value optimized out>) at Python/pythonrun.c:1353
#31 0x00007ffff7d3a700 in PyRun_FileExFlags (fp=0x6c69c0, filename=0x7fffffffe137 "infer.py", start=<value optimized out>, globals=
    0x640160, locals=0x640160, closeit=1, flags=0x7fffffffdd10) at Python/pythonrun.c:1339
#32 0x00007ffff7d3bc0c in PyRun_SimpleFileExFlags (fp=0x6c69c0, filename=0x7fffffffe137 "infer.py", closeit=1, flags=0x7fffffffdd10)
    at Python/pythonrun.c:943
---Type <return> to continue, or q <return> to quit---start infer.pyq
#33 0x00007ffff7d4d4cc in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:639
#34 0x000000318ae1ecdd in __libc_start_main () from /lib64/libc.so.6
#35 0x0000000000400659 in _start ()
(gdb) f 1
#1  0x00007fffefaad23f in paddle::MultiGradientMachine::getOutArgs(std::vector<paddle::Argument, std::allocator<paddle::Argument> >*,
 paddle::enumeration_wrapper::PassType) () at /home/lizhao/Paddle/paddle/gserver/gradientmachines/MultiGradientMachine.h:354
354       void waitOutArgsReady() { outArgsReadySem_.wait(); }
(gdb) l
349
350       void start();
351
352       void onPassEnd() { gradientMachine_->onPassEnd(); }
353
354       void waitOutArgsReady() { outArgsReadySem_.wait(); }
355
356       void notifyTaskReady() { taskReadySem_.post(); }
357
358       int getDeviceId() const { return deviceId_; }
(gdb) i threads
  29 Thread 0x7fffa29cc700 (LWP 15404)  0x000000318aeddfc3 in poll () from /lib64/libc.so.6
  28 Thread 0x7fffa33cd700 (LWP 15403)  0x000000318aee99af in accept4 () from /lib64/libc.so.6
* 1 Thread 0x7ffff7c3b700 (LWP 14790)  0x000000318b20d720 in sem_wait () from /lib64/libpthread.so.0

PaddlePaddle / Paddle

inferring by using multi-thread will be hung and the results are not right #2565