Open wanghaoshuang opened 7 years ago
This is a bug when parsing optimization configs. Is there core files generated? Can you find the full call stack using gdb
and the core file?
This is a PaddleCloud job. How should i get the core file from paddle cloud?
One way is to download the core file and then use the core file locally in a docker container.
The core file located under /pfs/dlnel/home/<your email>/jobs/<job-name>
Stack trace looks like in the core file:
#0 0x00007f6f556427fb in runtime.sched_getaffinity () at /usr/local/go/src/runtime/sys_linux_amd64.s:519
#1 0x00007f6f55a0bc82 in encoding/gob.(*Encoder).encodeArray (enc=0x7f6f55a0c518 <encoding/gob.(*Encoder).encodeInterface+472>, b=0xc4202207e0, value=..., op=
{void (struct encoding/gob.encInstr *, struct encoding/gob.encoderState *, reflect.Value)} 0xc4200cb808, elemIndir=1, length=140116168536416, helper=
{void (struct encoding/gob.encoderState *, reflect.Value, bool *)} 0xc4200cb820) at /usr/local/go/src/encoding/gob/encode.go:348
#2 0x000000c4200cb878 in ?? ()
#3 0x00007f6f55a0c518 in encoding/gob.(*Encoder).encodeInterface (enc=0x7f6f567a68e0, b=0xc42021c820, iv=...) at /usr/local/go/src/encoding/gob/encode.go:406
#4 0x000000c42021a740 in ?? ()
#5 0x00007f6f567a68e0 in typerel.* () from /usr/local/lib/python2.7/dist-packages/py_paddle/_swig_paddle.so
#6 0x000000c42021c820 in ?? ()
#7 0x0000000000000099 in ?? ()
#8 0x0000000000000001 in ?? ()
#9 0x00007f6f567a68e0 in typerel.* () from /usr/local/lib/python2.7/dist-packages/py_paddle/_swig_paddle.so
#10 0x000000c42021c820 in ?? ()
#11 0x0000000000000099 in ?? ()
#12 0x0000000000000000 in ?? ()
而且这个问题只在使用gpu的时候才可以稳定复现,使用cpu执行正常。core在了cgo的encoding/gob.encoderState
感觉是一个比较麻烦的问题了,可能是cgo的runtime和cuda有些冲突?
提交paddlecloud fault tolerant job, 出现如下错误:
submit.sh :
更新到最新paddle: