Closed 1292765944 closed 6 years ago
I could reproduce the error in python2. I'll fix it.
For python3, I coundn't reproduce.
@Hakuyume so what's problem may be for python3? I don't know how to fix it. Thanks for your help.
@1292765944
so what's problem may be for python3?
Could you try the MNIST example of ChainerMN? https://github.com/chainer/chainermn/tree/master/examples/mnist
@Hakuyume I meet the same problem as in SSD for gpu, and is ok in cpu.
in gpu
$ mpiexec -n 1 python examples/mnist/train_mnist.py --gpu
==========================================
Num process (COMM_WORLD): 1
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
[cernet-Precision-Tower-7910:05048] *** Process received signal ***
[cernet-Precision-Tower-7910:05048] Signal: Segmentation fault (11)
[cernet-Precision-Tower-7910:05048] Signal code: Invalid permissions (2)
[cernet-Precision-Tower-7910:05048] Failing at address: 0x50cf00000
[cernet-Precision-Tower-7910:05048] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fd924b38330]
[cernet-Precision-Tower-7910:05048] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x9aa50) [0x7fd9240eca50]
[cernet-Precision-Tower-7910:05048] [ 2] /usr/lib/libmpi.so.1(+0x10362d) [0x7fd91edd162d]
[cernet-Precision-Tower-7910:05048] [ 3] /usr/lib/libmpi.so.1(ompi_datatype_sndrcv+0x502) [0x7fd91ed32392]
[cernet-Precision-Tower-7910:05048] [ 4] /usr/lib/libmpi.so.1(PMPI_Alltoall+0x154) [0x7fd91ed33ad4]
[cernet-Precision-Tower-7910:05048] [ 5] /home/cernet/software/anaconda2/lib/python2.7/site-packages/mpi4py/MPI.so(+0xe5815) [0x7fd91f134815]
[cernet-Precision-Tower-7910:05048] [ 6] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x7b35) [0x7fd924e431e5]
[cernet-Precision-Tower-7910:05048] [ 7] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8c95) [0x7fd924e44345]
[cernet-Precision-Tower-7910:05048] [ 8] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8c95) [0x7fd924e44345]
[cernet-Precision-Tower-7910:05048] [ 9] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e) [0x7fd924e44c3e]
[cernet-Precision-Tower-7910:05048] [10] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(+0x79a61) [0x7fd924dbfa61]
[cernet-Precision-Tower-7910:05048] [11] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x53) [0x7fd924d8fe93]
[cernet-Precision-Tower-7910:05048] [12] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x61d6) [0x7fd924e41886]
[cernet-Precision-Tower-7910:05048] [13] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8c95) [0x7fd924e44345]
[cernet-Precision-Tower-7910:05048] [14] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8c95) [0x7fd924e44345]
[cernet-Precision-Tower-7910:05048] [15] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e) [0x7fd924e44c3e]
[cernet-Precision-Tower-7910:05048] [16] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47) [0x7fd924e441f7]
[cernet-Precision-Tower-7910:05048] [17] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e) [0x7fd924e44c3e]
[cernet-Precision-Tower-7910:05048] [18] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47) [0x7fd924e441f7]
[cernet-Precision-Tower-7910:05048] [19] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e) [0x7fd924e44c3e]
[cernet-Precision-Tower-7910:05048] [20] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x7fd924e44d52]
[cernet-Precision-Tower-7910:05048] [21] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyRun_FileExFlags+0xb0) [0x7fd924e65450]
[cernet-Precision-Tower-7910:05048] [22] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xef) [0x7fd924e6562f]
[cernet-Precision-Tower-7910:05048] [23] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(Py_Main+0xca4) [0x7fd924e7afd4]
[cernet-Precision-Tower-7910:05048] [24] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fd924073f45]
[cernet-Precision-Tower-7910:05048] [25] python() [0x400729]
[cernet-Precision-Tower-7910:05048] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 5048 on node cernet-Precision-Tower-7910 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
in cpu.
mpiexec -n 4 python examples/mnist/train_mnist.py
Warning: using naive communicator because only naive supports CPU-only execution
Warning: using naive communicator because only naive supports CPU-only execution
Warning: using naive communicator because only naive supports CPU-only execution
Warning: using naive communicator because only naive supports CPU-only execution
==========================================
Num process (COMM_WORLD): 4
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.296499 0.109997 0.913 0.9674 45.1247
2 0.091719 0.0788497 0.971533 0.9746 88.971
3 0.0552213 0.0735168 0.983 0.9782 136.086
4 0.0352824 0.0759333 0.988467 0.9765 181.252
5 0.0237484 0.0657624 0.993267 0.9807 225.939
6 0.0156474 0.0716794 0.995533 0.9794 270.866
7 0.0151741 0.0710823 0.995067 0.9802 319.324
8 0.0100485 0.0851148 0.996267 0.9769 369.219
9 0.0100774 0.0738104 0.996133 0.9814 417.197
10 0.0104382 0.0925425 0.9968 0.9781 462.58
11 0.0104268 0.0761318 0.996467 0.9804 508.408
12 0.00994524 0.0808251 0.996267 0.9811 552.214
13 0.00767433 0.0926397 0.9974 0.9774 597.125
14 0.011684 0.0878579 0.9964 0.981 641.157
15 0.00823742 0.0796076 0.997133 0.9831 684.464
16 0.0112519 0.0868928 0.9968 0.9808 727.345
17 0.00445278 0.0724809 0.998533 0.9836 770.573
18 0.000801414 0.080557 0.999933 0.9846 815.599
19 0.00124838 0.116107 0.9998 0.9779 859.786
20 0.00412491 0.111526 0.998467 0.978 904.204
Perhaps, this document will help you. http://chainermn.readthedocs.io/en/latest/installation/troubleshooting.html
@Hakuyume Thanks! My problem is solved. The OpenMPI is installed incorrectly.
In my experiments, the script train_multi.py is used to train SSD300 with 2 gpus. The errors is shown in the below for both python2 and python3. chainer version: 4.1.0 chainermn version: 1.3.0 chainercv version: master version
Log of python version: 2.7.13 (anaconda)
Log of python version: 3.6.5 (anaconda)