Fault tolerant job init params error

wanghaoshuang commented 7 years ago

提交paddlecloud fault tolerant job，出现如下错误：

==========================train-trainer-zm4qs==========================
label selector: paddle-job-master=train, desired: 1
running pod list:  [('Running', '***')]
label selector: paddle-job=train, desired: 1
running pod list:  [('Running', '***')]
Starting training job:  /pfs/***/home/***/jobs/train, num_gradient_servers: 1, trainer_id:  0, version:
I0823 09:08:14.278625    21 Util.cpp:166] commandline:  --num_gradient_servers=1 --ports_num_for_sparse=1 --use_gpu=1 --trainer_id=0 --trainer_count=1 --num_passes=1 --ports_num=1 --port=7164
[INFO 2017-08-23 09:08:17,608 layers.py:2479] output for __conv_pool_0___conv: c = 20, h = 24, w = 24, size = 11520
[INFO 2017-08-23 09:08:17,613 layers.py:2604] output for __conv_pool_0___pool: c = 20, h = 12, w = 12, size = 2880
[INFO 2017-08-23 09:08:17,617 layers.py:2479] output for __conv_pool_1___conv: c = 50, h = 8, w = 8, size = 3200
[INFO 2017-08-23 09:08:17,620 layers.py:2604] output for __conv_pool_1___pool: c = 50, h = 4, w = 4, size = 800
I0823 09:08:17.634304    21 GradientMachine.cpp:85] Initing parameters..
I0823 09:08:17.644143    21 GradientMachine.cpp:92] Init parameters done.
time="2017-08-23T09:08:17Z" level=info msg="Connected to etcd: http://****
"
time="2017-08-23T09:08:17Z" level=info msg="Trying to acquire lock at /init_ps/lock."
time="2017-08-23T09:08:17Z" level=info msg="Successfully acquired lock at /init_ps/lock."
time="2017-08-23T09:08:17Z" level=info msg="Trainer selected."
I0823 09:08:17.654239    21 NewRemoteParameterUpdater.cpp:68] paddle_begin_init_params start
I0823 09:08:17.654561    21 NewRemoteParameterUpdater.cpp:71] old param config: name: "___conv_pool_0___conv.w0"
size: 500
initial_mean: 0
initial_std: 0.282842712474619
initial_strategy: 0
initial_smart: false
para_id: 0
*** Aborted at 1503479297 (unix time) try "date -d @1503479297" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x1020ea00750) received by PID 21 (TID 0x7f12cb950700) from PID 245368656; stack trace: ***
    @     0x7f124834e86d runtime.sigfwd

submit.sh :

paddlecloud submit \
-jobname train \
-cpu 1 \
-gpu 1 \
-memory 3Gi \
-parallelism 1 \
-pscpu 1 \
-pservers 1 \
-psmemory 1Gi \
-passes 1 \
-faulttolerant \
-entry "python train_ft.py train" ./recognize_digits/

更新到最新paddle:

==========================train-trainer-107vp==========================
label selector: paddle-job-master=train, desired: 1
current cnt: 0 sleep for 5 seconds...
label selector: paddle-job=train, desired: 1
Starting training job:  /***/home/***/jobs/train, num_gradient_servers: 1, trainer_id:  0, version:
I0823 10:59:35.705291    34 Util.cpp:166] commandline:  --num_gradient_servers=1 --ports_num_for_sparse=1 --use_gpu=1 --trainer_id=0 --trainer_count=1 --num_passes=1 --ports_num=1 --port=7164
[INFO 2017-08-23 10:59:39,575 layers.py:2479] output for __conv_pool_0___conv: c = 20, h = 24, w = 24, size = 11520
[INFO 2017-08-23 10:59:39,576 layers.py:2604] output for __conv_pool_0___pool: c = 20, h = 12, w = 12, size = 2880
[INFO 2017-08-23 10:59:39,577 layers.py:2479] output for __conv_pool_1___conv: c = 50, h = 8, w = 8, size = 3200
[INFO 2017-08-23 10:59:39,577 layers.py:2604] output for __conv_pool_1___pool: c = 50, h = 4, w = 4, size = 800
I0823 10:59:39.585695    34 GradientMachine.cpp:85] Initing parameters..
I0823 10:59:39.591514    34 GradientMachine.cpp:92] Init parameters done.
time="2017-08-23T10:59:39Z" level=info msg="Connected to etcd: http://***
"
time="2017-08-23T10:59:39Z" level=info msg="Trying to acquire lock at /init_ps/lock."
time="2017-08-23T10:59:39Z" level=info msg="Successfully acquired lock at /init_ps/lock."
time="2017-08-23T10:59:39Z" level=info msg="Trainer selected."
I0823 10:59:39.620086    34 NewRemoteParameterUpdater.cpp:68] paddle_begin_init_params start
E0823 10:59:39.620120    34 NewRemoteParameterUpdater.cpp:109] got unsupported v1 learning_rate_schedule config: poly, set to const
*** Aborted at 1503485979 (unix time) try "date -d @1503485979" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x1020ea00750) received by PID 34 (TID 0x7f59c6443700) from PID 245368656; stack trace: ***
    @     0x7f5971fa886d runtime.sigfwd

typhoonzero commented 7 years ago

This is a bug when parsing optimization configs. Is there core files generated? Can you find the full call stack using gdb and the core file?

wanghaoshuang commented 7 years ago

This is a PaddleCloud job. How should i get the core file from paddle cloud?

typhoonzero commented 7 years ago

One way is to download the core file and then use the core file locally in a docker container.

Yancey1989 commented 7 years ago

The core file located under /pfs/dlnel/home/<your email>/jobs/<job-name>

typhoonzero commented 7 years ago

Stack trace looks like in the core file:

#0  0x00007f6f556427fb in runtime.sched_getaffinity () at /usr/local/go/src/runtime/sys_linux_amd64.s:519
#1  0x00007f6f55a0bc82 in encoding/gob.(*Encoder).encodeArray (enc=0x7f6f55a0c518 <encoding/gob.(*Encoder).encodeInterface+472>, b=0xc4202207e0, value=..., op=
    {void (struct encoding/gob.encInstr *, struct encoding/gob.encoderState *, reflect.Value)} 0xc4200cb808, elemIndir=1, length=140116168536416, helper=
    {void (struct encoding/gob.encoderState *, reflect.Value, bool *)} 0xc4200cb820) at /usr/local/go/src/encoding/gob/encode.go:348
#2  0x000000c4200cb878 in ?? ()
#3  0x00007f6f55a0c518 in encoding/gob.(*Encoder).encodeInterface (enc=0x7f6f567a68e0, b=0xc42021c820, iv=...) at /usr/local/go/src/encoding/gob/encode.go:406
#4  0x000000c42021a740 in ?? ()
#5  0x00007f6f567a68e0 in typerel.* () from /usr/local/lib/python2.7/dist-packages/py_paddle/_swig_paddle.so
#6  0x000000c42021c820 in ?? ()
#7  0x0000000000000099 in ?? ()
#8  0x0000000000000001 in ?? ()
#9  0x00007f6f567a68e0 in typerel.* () from /usr/local/lib/python2.7/dist-packages/py_paddle/_swig_paddle.so
#10 0x000000c42021c820 in ?? ()
#11 0x0000000000000099 in ?? ()
#12 0x0000000000000000 in ?? ()

而且这个问题只在使用gpu的时候才可以稳定复现，使用cpu执行正常。core在了cgo的encoding/gob.encoderState

感觉是一个比较麻烦的问题了，可能是cgo的runtime和cuda有些冲突？

PaddlePaddle / PaddleCloud

Fault tolerant job init params error #340