chainer / chainermn

ChainerMN: Scalable distributed deep learning with Chainer
https://chainer.org
MIT License
207 stars 57 forks source link

Bad termination / Segmentation fault MNIST test #64

Closed Fhrozen closed 7 years ago

Fhrozen commented 7 years ago

Hi there, I was trying to test, and after installed it and testing the MNIST example I got this:


`[nelson-lab0:04968] *** Process received signal ***
[nelson-lab0:04968] Signal: Segmentation fault (11)
[nelson-lab0:04968] Signal code: Invalid permissions (2)
[nelson-lab0:04968] Failing at address: 0x2c0d820000
[nelson-lab0:04968] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f60496ad390]
[nelson-lab0:04968] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x14d566)[0x7f6049420566]
[nelson-lab0:04968] [ 2] /usr/lib/libopen-pal.so.13(+0x2fcb7)[0x7f600f328cb7]
[nelson-lab0:04968] [ 3] /usr/lib/libmpi.so.12(ompi_datatype_sndrcv+0x54c)[0x7f600fa6d6bc]
[nelson-lab0:04968] [ 4] /usr/lib/libmpi.so.12(MPI_Alltoall+0x16c)[0x7f600fa6f67c]
[nelson-lab0:04968] [ 5] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(+0x8dd99)[0x7f600fd82d99]
[nelson-lab0:04968] [ 6] python(PyEval_EvalFrameEx+0x68a)[0x4c468a]
[nelson-lab0:04968] [ 7] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [ 8] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [ 9] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [10] python[0x4de6fe]
[nelson-lab0:04968] [11] python(PyObject_Call+0x43)[0x4b0cb3]
[nelson-lab0:04968] [12] python(PyEval_EvalFrameEx+0x2ad1)[0x4c6ad1]
[nelson-lab0:04968] [13] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [14] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04968] [15] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [16] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04968] [17] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [18] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04968] [19] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04968] [20] python(PyEval_EvalCode+0x19)[0x4c2509]
[nelson-lab0:04968] [21] python[0x4f1def]
[nelson-lab0:04968] [22] python(PyRun_FileExFlags+0x82)[0x4ec652]
[nelson-lab0:04968] [23] python(PyRun_SimpleFileExFlags+0x191)[0x4eae31]
[nelson-lab0:04968] [24] python(Py_Main+0x68a)[0x49e14a]
[nelson-lab0:04968] [25] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f60492f3830]
[nelson-lab0:04968] [26] python(_start+0x29)[0x49d9d9]
[nelson-lab0:04968] *** End of error message ***
[nelson-lab0:04969] *** Process received signal ***
[nelson-lab0:04969] Signal: Segmentation fault (11)
[nelson-lab0:04969] Signal code: Invalid permissions (2)
[nelson-lab0:04969] Failing at address: 0x2c0d820000
[nelson-lab0:04969] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f4a9d3a1390]
[nelson-lab0:04969] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x14d566)[0x7f4a9d114566]
[nelson-lab0:04969] [ 2] /usr/lib/libopen-pal.so.13(+0x2fcb7)[0x7f4a6301ccb7]
[nelson-lab0:04969] [ 3] /usr/lib/libmpi.so.12(ompi_datatype_sndrcv+0x54c)[0x7f4a637616bc]
[nelson-lab0:04969] [ 4] /usr/lib/libmpi.so.12(MPI_Alltoall+0x16c)[0x7f4a6376367c]
[nelson-lab0:04969] [ 5] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(+0x8dd99)[0x7f4a63a76d99]
[nelson-lab0:04969] [ 6] python(PyEval_EvalFrameEx+0x68a)[0x4c468a]
[nelson-lab0:04969] [ 7] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [ 8] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [ 9] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [10] python[0x4de6fe]
[nelson-lab0:04969] [11] python(PyObject_Call+0x43)[0x4b0cb3]
[nelson-lab0:04969] [12] python(PyEval_EvalFrameEx+0x2ad1)[0x4c6ad1]
[nelson-lab0:04969] [13] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [14] python(PyEval_EvalFrameEx+0x5d8f)[0x4c9d8f]
[nelson-lab0:04969] [15] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [16] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04969] [17] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [18] python(PyEval_EvalFrameEx+0x68d1)[0x4ca8d1]
[nelson-lab0:04969] [19] python(PyEval_EvalCodeEx+0x255)[0x4c2765]
[nelson-lab0:04969] [20] python(PyEval_EvalCode+0x19)[0x4c2509]
[nelson-lab0:04969] [21] python[0x4f1def]
[nelson-lab0:04969] [22] python(PyRun_FileExFlags+0x82)[0x4ec652]
[nelson-lab0:04969] [23] python(PyRun_SimpleFileExFlags+0x191)[0x4eae31]
[nelson-lab0:04969] [24] python(Py_Main+0x68a)[0x49e14a]
[nelson-lab0:04969] [25] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f4a9cfe7830]
[nelson-lab0:04969] [26] python(_start+0x29)[0x49d9d9]
[nelson-lab0:04969] *** End of error message ***

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 4968 RUNNING AT nelson-lab0
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
`

The first line about segmentation error, and the second about Invalid permision (?) I am wondering about a mistake during installing? or with the GPUs?? My computer has 3 GPUs (2 Titan X and 1 960)

Btw, I do not have any problem testing on no-gpu Regards

iwiwi commented 7 years ago

At glance it is seemingly related to CUDA-Awareness of your MPI. Could you check this? https://chainermn.readthedocs.io/en/latest/installation/troubleshooting.html

iwiwi commented 7 years ago

Please reopen this if it does not address your issue. Thanks.