chainer / chainercv

ChainerCV: a Library for Deep Learning in Computer Vision
MIT License
1.48k stars 304 forks source link

add MaskRCNN #781

Closed yuyu2172 closed 5 years ago

yuyu2172 commented 5 years ago

~Merge after #778~

yuyu2172 commented 5 years ago

Wierd segfault during training... Experiemented with n_gpu=1, batchsize=1. For debugging purpose, I changed MultithreadIterator to SerialIterator.

 *** Process received signal ***........]  1.42%                                                                        
 Signal: Segmentation fault (11)........] 17.32%                                                                        
 Signal code: Address not mapped (1)                                                                                    
 Failing at address: 0x67ee6000s, 21:36:26.539275.                                                                      
 [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fbb433f5390]                                                   
 [ 1] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7971a0)[0x7fb9d82c31a0]             
 [ 2] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x61ce58)[0x7fb9d8148e58]             
 [ 3] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7cf482)[0x7fb9d82fb482]             
 [ 4] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7d0879)[0x7fb9d82fc879]             
 [ 5] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x2ef784)[0x7fb9d7e1b784]             
 [ 6] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x13c)[0x7fbb43733d6c]                              
 [ 7] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallKeywords+0x4d)[0x7fbb43733bfd]                           
 [ 8] /usr/local/lib/libpython3.6m.so.1.0(+0x18b9fb)[0x7fbb4378c9fb]                                                    
 [ 9] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fbb43785a83]                              
 [10] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fbb4378ce0a]                                                    
 [11] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fbb4378cadb]                                                    
 [12] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fbb43784a6f]                               
 [13] /usr/local/lib/libpython3.6m.so.1.0(+0x18c0f3)[0x7fbb4378d0f3]                                                    
 [14] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fbb4378cadb]                                                    
 [15] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fbb43785a83]                              
 [16] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fbb4378ce0a]                                                    
 [17] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fbb4378cadb]                                                    
 [18] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fbb43784a6f]                               
 [19] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fbb4378eacb]                               
 [20] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fbb436fa86e]                                 
 [21] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fbb436fb1f1]                                  
 [22] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fbb436fac17]                                           
 [23] /usr/local/lib/libpython3.6m.so.1.0(+0x147506)[0x7fbb43748506]                                                    
 [24] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fbb436fac17]                                           
 [25] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1c69)[0x7fbb43786389]                              
 [26] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fbb4378eacb]                               
 [27] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fbb436fa86e]                                 
 [28] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fbb436fb1f1]                                  
 [29] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fbb436fac17]                                           
 *** End of error message ***                                                                                           

other configs

yuyu2172 commented 5 years ago
$ mpiexec -n 8 python3 train_multi.py --batchsize 8
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'coco1-worker-0', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
[coco1-worker-0:115997] 7 more processes have sent help message help-mpi-btl-openib.txt / default subnet prefix
[coco1-worker-0:115997] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
0 (1, 3, 1344, 576)
6 (1, 3, 800, 1216)
4 (1, 3, 1088, 800)
7 (1, 3, 800, 1216)
15 (1, 3, 800, 1088)
19 (1, 3, 1088, 800)
19 (1, 3, 704, 1344)
41 (1, 3, 800, 1088)
[coco1-worker-0:116005] Failed to cuMemcpy GPU memory, rc=-1
--------------------------------------------------------------------------
The call to cuMemcpyAsync failed. This is a unrecoverable error and will
cause the program to abort.
  cuMemcpyAsync(0x7f5f2f22ac00, 0x7f5df4dfb000, 8192) returned value 1
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
Exception in main training loop: MPI_ERR_TRUNCATE: message truncated
Traceback (most recent call last):
  File "/home/yuyu2172/chainer/chainer/training/trainer.py", line 316, in run
    update()
  File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 170, in update
    self.update_core()
  File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 182, in update_core
    optimizer.update(loss_func, *in_arrays)
  File "/home/yuyu2172/chainer/chainermn/optimizers.py", line 28, in update
    self.communicator.bcast_data(target)
  File "/home/yuyu2172/chainer/chainermn/communicators/mpi_communicator_base.py", line 615, in bcast_data
    self.mpi_comm.Bcast(buf)
  File "mpi4py/MPI/Comm.pyx", line 579, in mpi4py.MPI.Comm.Bcast
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "train_multi.py", line 249, in <module>
    main()
  File "train_multi.py", line 245, in main
    trainer.run()
  File "/home/yuyu2172/chainer/chainer/training/trainer.py", line 349, in run
    six.reraise(*exc_info)
  File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/yuyu2172/chainer/chainer/training/trainer.py", line 316, in run
    update()
  File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 170, in update
    self.update_core()
  File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 182, in update_core
    optimizer.update(loss_func, *in_arrays)
  File "/home/yuyu2172/chainer/chainermn/optimizers.py", line 28, in update
    self.communicator.bcast_data(target)
  File "/home/yuyu2172/chainer/chainermn/communicators/mpi_communicator_base.py", line 615, in bcast_data
    self.mpi_comm.Bcast(buf)
  File "mpi4py/MPI/Comm.pyx", line 579, in mpi4py.MPI.Comm.Bcast
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
yuyu2172 commented 5 years ago

Seemingly working with the current master Chainer: https://github.com/chainer/chainer/commit/afe903389d822583a5355e9d46e6766d048ebeb5 CuPy: https://github.com/cupy/cupy/commit/155228fdf6bb148a4c5537dfe08fbfef5329e416

EDIT:

*** Process received signal ***........] 48.41%
Signal: Segmentation fault (11)##......] 89.34%
Signal code: Address not mapped (1)
Failing at address: 0x6af2d000, 9:44:55.033078.
[ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fc50f859390]
[ 1] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7971a0)[0x7fc422b991a0]
[ 2] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x61ce58)[0x7fc422a1ee58]
[ 3] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7cf482)[0x7fc422bd1482]
[ 4] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7d0879)[0x7fc422bd2879]
[ 5] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x2ef784)[0x7fc4226f1784]
[ 6] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x13c)[0x7fc50fb97d6c]
[ 7] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallKeywords+0x4d)[0x7fc50fb97bfd]
[ 8] /usr/local/lib/libpython3.6m.so.1.0(+0x18b9fb)[0x7fc50fbf09fb]
[ 9] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fc50fbe9a83]
[10] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fc50fbf0e0a]
[11] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fc50fbf0adb]
[12] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fc50fbe8a6f]
[13] /usr/local/lib/libpython3.6m.so.1.0(+0x18c0f3)[0x7fc50fbf10f3]
[14] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fc50fbf0adb]
[15] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fc50fbe9a83]
[16] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fc50fbf0e0a]
[17] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fc50fbf0adb]
[18] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fc50fbe8a6f]
[19] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fc50fbf2acb]
[20] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fc50fb5e86e]
[21] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fc50fb5f1f1]
[22] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fc50fb5ec17]
[23] /usr/local/lib/libpython3.6m.so.1.0(+0x147506)[0x7fc50fbac506]
[24] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fc50fb5ec17]
[25] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1c69)[0x7fc50fbea389]
[26] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fc50fbf2acb]
[27] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fc50fb5e86e]
[28] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fc50fb5f1f1]
[29] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fc50fb5ec17]
*** End of error message ***

EDIT: Segfault problem seems to have stopped after #798

yuyu2172 commented 5 years ago

For speeding up training process

What I did

What I did not do

yuyu2172 commented 5 years ago

The performance of the trained model at this moment (should be 33.9mmAP)

mmAP (all): 0.31496558    
mmAP (large): 0.46482706  
mmAP (medium): 0.33771247 
mmAP (small): 0.14785959  

~With the network trained previously~ I forgot how I got this weight...

mmAP (all): 0.34748608
mmAP (large): 0.5130962
mmAP (medium): 0.37304157
mmAP (small): 0.16328236

EDIT: I suspected that gt_bbox should be calculcated by mask_to_bbox. It turns out that this is irrelevant to the performance drop. https://github.com/chainer/chainercv/blob/7e707d8ab247a03de433135c59ef2bf3fd9de35b/chainercv/links/model/mask_rcnn/mask_head.py#L258

After changing the script to use mask_to_bbox
mmAP (all): 0.31432453
mmAP (large): 0.46316248
mmAP (medium): 0.33836472
mmAP (small): 0.14966531

EDIT: I tried normalizing mask_loss over all RoIs used by one GPU. mmAP (all): 0.31567773
mmAP (large): 0.4621393
mmAP (medium): 0.33851567 mmAP (small): 0.1468797

yuyu2172 commented 5 years ago

Finally reproduced. It is critical to implement mask_to_segm correctly. 2d44d66

mmAP (all): 0.34442434     
mmAP (large): 0.50526273   
mmAP (medium): 0.36985907  
mmAP (small): 0.16083415   

Note

Performance of other implementations Detectron: 0.339 maskrcnn-benchmark: 0.342

yuyu2172 commented 5 years ago

Performance after deleting +1 to bbox https://github.com/yuyu2172/chainercv/commit/6513e2480e98ce3773ec5566c9923be51f75bb23 The weight is trained with +1 convention.

mmAP (all): 0.33432382    
mmAP (large): 0.5024262   
mmAP (medium): 0.36352885 
mmAP (small): 0.14599578  

Performance after deleting +1 to box and training without +1 convention.

mmAP (all): 0.348018     
mmAP (large): 0.5091206  
mmAP (medium): 0.37551004
mmAP (small): 0.16778706 
yuyu2172 commented 5 years ago

After https://github.com/chainer/chainercv/pull/781/commits/7e1e3ecee5930c527c7f1c89cdd4719d826095e5 and https://github.com/chainer/chainercv/pull/798

mmAP (all): 0.33900946  
mmAP (large): 0.5053451 
mmAP (medium): 0.3659876
mmAP (small): 0.1571092 

mmAP (all): 0.34121135
mmAP (large): 0.50168735
mmAP (medium): 0.37085357
mmAP (small): 0.15417978

mmAP (all): 0.34130338
mmAP (large): 0.50934994
mmAP (medium): 0.36974284
mmAP (small): 0.152321

mmAP (all): 0.3393448
mmAP (large): 0.50526506
mmAP (medium): 0.36583152
mmAP (small): 0.15407379

mmAP (all): 0.3411054
mmAP (large): 0.5059621
mmAP (medium): 0.3677923
mmAP (small): 0.15057199

https://github.com/yuyu2172/chainercv/commit/743b555e834e228d4875655639f207cfed229043

mmAP (all): 0.3417589    
mmAP (large): 0.5081606  
mmAP (medium): 0.36930135
mmAP (small): 0.15626407 
yuyu2172 commented 5 years ago
mmAP (all): 0.35994896
mmAP (large): 0.5378797
mmAP (medium): 0.39182574
mmAP (small): 0.16467538
yuyu2172 commented 5 years ago

Uploaded Weight scores

ResNet50 Mask

mmAP (all): 0.34175715    
mmAP (large): 0.50817233  
mmAP (medium): 0.36931357 
mmAP (small): 0.15625913  

ResNet50 Bbox

mmAP (all): 0.37964782     
mmAP (large): 0.4980176    
mmAP (medium): 0.4137187   
mmAP (small): 0.22214551   

ResNet101 Mask

mmAP (all): 0.35996467     
mmAP (large): 0.53794646   
mmAP (medium): 0.39182332  
mmAP (small): 0.16468713   

ResNet101 Bbox

mmAP (all): 0.40394193    
mmAP (large): 0.5254983   
mmAP (medium): 0.4450884  
mmAP (small): 0.23563214  
yuyu2172 commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit 2e3fa01:

yuyu2172 commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit 900cdf4:

yuyu2172 commented 5 years ago

@Hakuyume Please review

yuyu2172 commented 5 years ago

TODO: mask_loss -> mask_head_loss bbox_loss -> bbox_head_loss

yuyu2172 commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit 5b01be4:

yuyu2172 commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit 3f265b8:

yuyu2172 commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit 9be25c7:

Hakuyume commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit 171d745:

yuyu2172 commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit a2d5e6d (3ab0c69):

yuyu2172 commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit 0820f5d (6ab0a1d):

yuyu2172 commented 5 years ago
  1. Define min_size and max_size in train_multi.py.
  2. use only cv2 backend in mask_utils.py
    • Change the behavior of resize. Make it break when chainer.config.cv_resize_backend == 'cv2' and cv2 is not installed.
    • Force cv2 backend in mask_utils.py
yuyu2172 commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit 6acc48f (2975879):

Hakuyume commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit 6acc48f (2975879):

yuyu2172 commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit 58ef6c7 (15c5cdc):

yuyu2172 commented 5 years ago

pfnCI, test this please

pfn-ci-bot commented 5 years ago

Successfully created a job for commit 6c4dc80 (9fad50d):