chainer / chainercv

ChainerCV: a Library for Deep Learning in Computer Vision
MIT License
1.48k stars 304 forks source link

SSD training with multiple-gpu #622

Closed 1292765944 closed 6 years ago

1292765944 commented 6 years ago

In my experiments, the script train_multi.py is used to train SSD300 with 2 gpus. The errors is shown in the below for both python2 and python3. chainer version: 4.1.0 chainermn version: 1.3.0 chainercv version: master version

Log of python version: 2.7.13 (anaconda)

$ mpiexec -n 2 python train_multi.py --model ssd300
Traceback (most recent call last):
  File "train_multi.py", line 143, in <module>
    main()
  File "train_multi.py", line 91, in main
    train = chainermn.scatter_dataset(train, comm, shuffle=True)
  File "/home/cernet/software/anaconda2/lib/python2.7/site-packages/chainermn/datasets/scatter_dataset.py", line 51, in scatter_dataset
    data = comm.bcast_obj(data, max_buf_len=max_buf_len, root=0)
  File "/home/cernet/software/anaconda2/lib/python2.7/site-packages/chainermn/communicators/mpi_communicator_base.py", line 505, in bcast_obj
    root=root)
  File "/home/cernet/software/anaconda2/lib/python2.7/site-packages/chainermn/communicators/_communication_utility.py", line 141, in chunked_bcast_obj
    pickled_bytes = pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 1380, in dumps
    Pickler(file, protocol).dump(obj)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 554, in save_tuple
    save(element)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 425, in save_reduce
    save(state)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 687, in _batch_setitems
    save(v)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 425, in save_reduce
    save(state)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 692, in _batch_setitems
    save(v)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 554, in save_tuple
    save(element)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 425, in save_reduce
    save(state)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 687, in _batch_setitems
    save(v)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 606, in save_list
    self._batch_appends(iter(obj))
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 639, in _batch_appends
    save(x)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 396, in save_reduce
    save(cls)
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernet/software/anaconda2/lib/python2.7/pickle.py", line 754, in save_global
    (obj, module, name))
pickle.PicklingError: Can't pickle <type 'instancemethod'>: it's not found as __builtin__.instancemethod
--------------------------------------------------------------------------
mpiexec noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

Log of python version: 3.6.5 (anaconda)

$ mpiexec -n 2 python train_multi.py --model ssd300
[cernet-Precision-Tower-7910:10814] *** Process received signal ***
[cernet-Precision-Tower-7910:10814] Signal: Segmentation fault (11)
[cernet-Precision-Tower-7910:10814] Signal code: Invalid permissions (2)
[cernet-Precision-Tower-7910:10814] Failing at address: 0x5d4500000
[cernet-Precision-Tower-7910:10814] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fea2dbd1330]
[cernet-Precision-Tower-7910:10814] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x9aa50) [0x7fea2d892a50]
[cernet-Precision-Tower-7910:10814] [ 2] /usr/lib/libmpi.so.1(+0x10362d) [0x7fe9dd84662d]
[cernet-Precision-Tower-7910:10814] [ 3] /usr/lib/libmpi.so.1(ompi_datatype_sndrcv+0x502) [0x7fe9dd7a7392]
[cernet-Precision-Tower-7910:10814] [ 4] /usr/lib/libmpi.so.1(PMPI_Alltoall+0x154) [0x7fe9dd7a8ad4]
[cernet-Precision-Tower-7910:10814] [ 5] /home/cernet/software/anaconda2/envs/Python3/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so(+0xdf591) [0x7fe9ddba3591]
[cernet-Precision-Tower-7910:10814] [ 6] python(_PyCFunction_FastCallDict+0x154) [0x7fea2e112b94]
[cernet-Precision-Tower-7910:10814] [ 7] python(+0x19e67c) [0x7fea2e1a267c]
[cernet-Precision-Tower-7910:10814] [ 8] python(_PyEval_EvalFrameDefault+0x2fa) [0x7fea2e1c4cba]
[cernet-Precision-Tower-7910:10814] [ 9] python(+0x19870b) [0x7fea2e19c70b]
[cernet-Precision-Tower-7910:10814] [10] python(+0x19e755) [0x7fea2e1a2755]
[cernet-Precision-Tower-7910:10814] [11] python(_PyEval_EvalFrameDefault+0x2fa) [0x7fea2e1c4cba]
[cernet-Precision-Tower-7910:10814] [12] python(+0x19870b) [0x7fea2e19c70b]
[cernet-Precision-Tower-7910:10814] [13] python(+0x19e755) [0x7fea2e1a2755]
[cernet-Precision-Tower-7910:10814] [14] python(_PyEval_EvalFrameDefault+0x2fa) [0x7fea2e1c4cba]
[cernet-Precision-Tower-7910:10814] [15] python(+0x197a94) [0x7fea2e19ba94]
[cernet-Precision-Tower-7910:10814] [16] python(_PyFunction_FastCallDict+0x1bb) [0x7fea2e19ce1b]
[cernet-Precision-Tower-7910:10814] [17] python(_PyObject_FastCallDict+0x26f) [0x7fea2e112f5f]
[cernet-Precision-Tower-7910:10814] [18] python(_PyObject_Call_Prepend+0x63) [0x7fea2e117a03]
[cernet-Precision-Tower-7910:10814] [19] python(PyObject_Call+0x3e) [0x7fea2e11299e]
[cernet-Precision-Tower-7910:10814] [20] python(_PyEval_EvalFrameDefault+0x1ab0) [0x7fea2e1c6470]
[cernet-Precision-Tower-7910:10814] [21] python(+0x19870b) [0x7fea2e19c70b]
[cernet-Precision-Tower-7910:10814] [22] python(+0x19e755) [0x7fea2e1a2755]
[cernet-Precision-Tower-7910:10814] [23] python(_PyEval_EvalFrameDefault+0x2fa) [0x7fea2e1c4cba]
[cernet-Precision-Tower-7910:10814] [24] python(+0x19870b) [0x7fea2e19c70b]
[cernet-Precision-Tower-7910:10814] [25] python(+0x19e755) [0x7fea2e1a2755]
[cernet-Precision-Tower-7910:10814] [26] python(_PyEval_EvalFrameDefault+0x2fa) [0x7fea2e1c4cba]
[cernet-Precision-Tower-7910:10814] [27] python(+0x197dae) [0x7fea2e19bdae]
[cernet-Precision-Tower-7910:10814] [28] python(+0x198941) [0x7fea2e19c941]
[cernet-Precision-Tower-7910:10814] [29] python(+0x19e755) [0x7fea2e1a2755]
[cernet-Precision-Tower-7910:10814] *** End of error message ***
/home/cernet/software/anaconda2/envs/Python3/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 4 leaked semaphores to clean up at shutdown
  len(cache))
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 10814 on node cernet-Precision-Tower-7910 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Hakuyume commented 6 years ago

I could reproduce the error in python2. I'll fix it.

For python3, I coundn't reproduce.

1292765944 commented 6 years ago

@Hakuyume so what's problem may be for python3? I don't know how to fix it. Thanks for your help.

Hakuyume commented 6 years ago

@1292765944

so what's problem may be for python3?

Could you try the MNIST example of ChainerMN? https://github.com/chainer/chainermn/tree/master/examples/mnist

1292765944 commented 6 years ago

@Hakuyume I meet the same problem as in SSD for gpu, and is ok in cpu.

in gpu

$ mpiexec -n 1 python examples/mnist/train_mnist.py --gpu
==========================================
Num process (COMM_WORLD): 1
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
[cernet-Precision-Tower-7910:05048] *** Process received signal ***
[cernet-Precision-Tower-7910:05048] Signal: Segmentation fault (11)
[cernet-Precision-Tower-7910:05048] Signal code: Invalid permissions (2)
[cernet-Precision-Tower-7910:05048] Failing at address: 0x50cf00000
[cernet-Precision-Tower-7910:05048] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fd924b38330]
[cernet-Precision-Tower-7910:05048] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x9aa50) [0x7fd9240eca50]
[cernet-Precision-Tower-7910:05048] [ 2] /usr/lib/libmpi.so.1(+0x10362d) [0x7fd91edd162d]
[cernet-Precision-Tower-7910:05048] [ 3] /usr/lib/libmpi.so.1(ompi_datatype_sndrcv+0x502) [0x7fd91ed32392]
[cernet-Precision-Tower-7910:05048] [ 4] /usr/lib/libmpi.so.1(PMPI_Alltoall+0x154) [0x7fd91ed33ad4]
[cernet-Precision-Tower-7910:05048] [ 5] /home/cernet/software/anaconda2/lib/python2.7/site-packages/mpi4py/MPI.so(+0xe5815) [0x7fd91f134815]
[cernet-Precision-Tower-7910:05048] [ 6] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x7b35) [0x7fd924e431e5]
[cernet-Precision-Tower-7910:05048] [ 7] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8c95) [0x7fd924e44345]
[cernet-Precision-Tower-7910:05048] [ 8] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8c95) [0x7fd924e44345]
[cernet-Precision-Tower-7910:05048] [ 9] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e) [0x7fd924e44c3e]
[cernet-Precision-Tower-7910:05048] [10] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(+0x79a61) [0x7fd924dbfa61]
[cernet-Precision-Tower-7910:05048] [11] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x53) [0x7fd924d8fe93]
[cernet-Precision-Tower-7910:05048] [12] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x61d6) [0x7fd924e41886]
[cernet-Precision-Tower-7910:05048] [13] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8c95) [0x7fd924e44345]
[cernet-Precision-Tower-7910:05048] [14] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8c95) [0x7fd924e44345]
[cernet-Precision-Tower-7910:05048] [15] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e) [0x7fd924e44c3e]
[cernet-Precision-Tower-7910:05048] [16] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47) [0x7fd924e441f7]
[cernet-Precision-Tower-7910:05048] [17] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e) [0x7fd924e44c3e]
[cernet-Precision-Tower-7910:05048] [18] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47) [0x7fd924e441f7]
[cernet-Precision-Tower-7910:05048] [19] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e) [0x7fd924e44c3e]
[cernet-Precision-Tower-7910:05048] [20] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x7fd924e44d52]
[cernet-Precision-Tower-7910:05048] [21] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyRun_FileExFlags+0xb0) [0x7fd924e65450]
[cernet-Precision-Tower-7910:05048] [22] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xef) [0x7fd924e6562f]
[cernet-Precision-Tower-7910:05048] [23] /home/cernet/software/anaconda2/bin/../lib/libpython2.7.so.1.0(Py_Main+0xca4) [0x7fd924e7afd4]
[cernet-Precision-Tower-7910:05048] [24] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fd924073f45]
[cernet-Precision-Tower-7910:05048] [25] python() [0x400729]
[cernet-Precision-Tower-7910:05048] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 5048 on node cernet-Precision-Tower-7910 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

in cpu.

mpiexec -n 4 python examples/mnist/train_mnist.py
Warning: using naive communicator because only naive supports CPU-only execution
Warning: using naive communicator because only naive supports CPU-only execution
Warning: using naive communicator because only naive supports CPU-only execution
Warning: using naive communicator because only naive supports CPU-only execution
==========================================
Num process (COMM_WORLD): 4
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.296499    0.109997              0.913          0.9674                    45.1247       
2           0.091719    0.0788497             0.971533       0.9746                    88.971        
3           0.0552213   0.0735168             0.983          0.9782                    136.086       
4           0.0352824   0.0759333             0.988467       0.9765                    181.252       
5           0.0237484   0.0657624             0.993267       0.9807                    225.939       
6           0.0156474   0.0716794             0.995533       0.9794                    270.866       
7           0.0151741   0.0710823             0.995067       0.9802                    319.324       
8           0.0100485   0.0851148             0.996267       0.9769                    369.219       
9           0.0100774   0.0738104             0.996133       0.9814                    417.197       
10          0.0104382   0.0925425             0.9968         0.9781                    462.58        
11          0.0104268   0.0761318             0.996467       0.9804                    508.408       
12          0.00994524  0.0808251             0.996267       0.9811                    552.214       
13          0.00767433  0.0926397             0.9974         0.9774                    597.125       
14          0.011684    0.0878579             0.9964         0.981                     641.157       
15          0.00823742  0.0796076             0.997133       0.9831                    684.464       
16          0.0112519   0.0868928             0.9968         0.9808                    727.345       
17          0.00445278  0.0724809             0.998533       0.9836                    770.573       
18          0.000801414  0.080557              0.999933       0.9846                    815.599       
19          0.00124838  0.116107              0.9998         0.9779                    859.786       
20          0.00412491  0.111526              0.998467       0.978                     904.204 
Hakuyume commented 6 years ago

@1292765944 Could you report the error of python3 to ChainerMN?

Hakuyume commented 6 years ago

Perhaps, this document will help you. http://chainermn.readthedocs.io/en/latest/installation/troubleshooting.html

1292765944 commented 6 years ago

@Hakuyume Thanks! My problem is solved. The OpenMPI is installed incorrectly.