Closed yuyu2172 closed 5 years ago
Wierd segfault during training...
Experiemented with n_gpu=1
, batchsize=1
.
For debugging purpose, I changed MultithreadIterator
to SerialIterator
.
*** Process received signal ***........] 1.42%
Signal: Segmentation fault (11)........] 17.32%
Signal code: Address not mapped (1)
Failing at address: 0x67ee6000s, 21:36:26.539275.
[ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fbb433f5390]
[ 1] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7971a0)[0x7fb9d82c31a0]
[ 2] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x61ce58)[0x7fb9d8148e58]
[ 3] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7cf482)[0x7fb9d82fb482]
[ 4] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7d0879)[0x7fb9d82fc879]
[ 5] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x2ef784)[0x7fb9d7e1b784]
[ 6] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x13c)[0x7fbb43733d6c]
[ 7] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallKeywords+0x4d)[0x7fbb43733bfd]
[ 8] /usr/local/lib/libpython3.6m.so.1.0(+0x18b9fb)[0x7fbb4378c9fb]
[ 9] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fbb43785a83]
[10] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fbb4378ce0a]
[11] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fbb4378cadb]
[12] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fbb43784a6f]
[13] /usr/local/lib/libpython3.6m.so.1.0(+0x18c0f3)[0x7fbb4378d0f3]
[14] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fbb4378cadb]
[15] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fbb43785a83]
[16] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fbb4378ce0a]
[17] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fbb4378cadb]
[18] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fbb43784a6f]
[19] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fbb4378eacb]
[20] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fbb436fa86e]
[21] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fbb436fb1f1]
[22] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fbb436fac17]
[23] /usr/local/lib/libpython3.6m.so.1.0(+0x147506)[0x7fbb43748506]
[24] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fbb436fac17]
[25] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1c69)[0x7fbb43786389]
[26] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fbb4378eacb]
[27] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fbb436fa86e]
[28] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fbb436fb1f1]
[29] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fbb436fac17]
*** End of error message ***
$ mpiexec -n 8 python3 train_multi.py --batchsize 8
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'coco1-worker-0', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.
Please see this FAQ entry for more details:
http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
[coco1-worker-0:115997] 7 more processes have sent help message help-mpi-btl-openib.txt / default subnet prefix
[coco1-worker-0:115997] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
0 (1, 3, 1344, 576)
6 (1, 3, 800, 1216)
4 (1, 3, 1088, 800)
7 (1, 3, 800, 1216)
15 (1, 3, 800, 1088)
19 (1, 3, 1088, 800)
19 (1, 3, 704, 1344)
41 (1, 3, 800, 1088)
[coco1-worker-0:116005] Failed to cuMemcpy GPU memory, rc=-1
--------------------------------------------------------------------------
The call to cuMemcpyAsync failed. This is a unrecoverable error and will
cause the program to abort.
cuMemcpyAsync(0x7f5f2f22ac00, 0x7f5df4dfb000, 8192) returned value 1
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
Exception in main training loop: MPI_ERR_TRUNCATE: message truncated
Traceback (most recent call last):
File "/home/yuyu2172/chainer/chainer/training/trainer.py", line 316, in run
update()
File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 170, in update
self.update_core()
File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 182, in update_core
optimizer.update(loss_func, *in_arrays)
File "/home/yuyu2172/chainer/chainermn/optimizers.py", line 28, in update
self.communicator.bcast_data(target)
File "/home/yuyu2172/chainer/chainermn/communicators/mpi_communicator_base.py", line 615, in bcast_data
self.mpi_comm.Bcast(buf)
File "mpi4py/MPI/Comm.pyx", line 579, in mpi4py.MPI.Comm.Bcast
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "train_multi.py", line 249, in <module>
main()
File "train_multi.py", line 245, in main
trainer.run()
File "/home/yuyu2172/chainer/chainer/training/trainer.py", line 349, in run
six.reraise(*exc_info)
File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/yuyu2172/chainer/chainer/training/trainer.py", line 316, in run
update()
File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 170, in update
self.update_core()
File "/home/yuyu2172/chainer/chainer/training/updaters/standard_updater.py", line 182, in update_core
optimizer.update(loss_func, *in_arrays)
File "/home/yuyu2172/chainer/chainermn/optimizers.py", line 28, in update
self.communicator.bcast_data(target)
File "/home/yuyu2172/chainer/chainermn/communicators/mpi_communicator_base.py", line 615, in bcast_data
self.mpi_comm.Bcast(buf)
File "mpi4py/MPI/Comm.pyx", line 579, in mpi4py.MPI.Comm.Bcast
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
Seemingly working with the current master Chainer: https://github.com/chainer/chainer/commit/afe903389d822583a5355e9d46e6766d048ebeb5 CuPy: https://github.com/cupy/cupy/commit/155228fdf6bb148a4c5537dfe08fbfef5329e416
EDIT:
*** Process received signal ***........] 48.41%
Signal: Segmentation fault (11)##......] 89.34%
Signal code: Address not mapped (1)
Failing at address: 0x6af2d000, 9:44:55.033078.
[ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fc50f859390]
[ 1] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7971a0)[0x7fc422b991a0]
[ 2] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x61ce58)[0x7fc422a1ee58]
[ 3] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7cf482)[0x7fc422bd1482]
[ 4] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x7d0879)[0x7fc422bd2879]
[ 5] /usr/local/lib/python3.6/site-packages/cv2.cpython-36m-x86_64-linux-gnu.so(+0x2ef784)[0x7fc4226f1784]
[ 6] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x13c)[0x7fc50fb97d6c]
[ 7] /usr/local/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallKeywords+0x4d)[0x7fc50fb97bfd]
[ 8] /usr/local/lib/libpython3.6m.so.1.0(+0x18b9fb)[0x7fc50fbf09fb]
[ 9] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fc50fbe9a83]
[10] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fc50fbf0e0a]
[11] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fc50fbf0adb]
[12] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fc50fbe8a6f]
[13] /usr/local/lib/libpython3.6m.so.1.0(+0x18c0f3)[0x7fc50fbf10f3]
[14] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fc50fbf0adb]
[15] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1363)[0x7fc50fbe9a83]
[16] /usr/local/lib/libpython3.6m.so.1.0(+0x18be0a)[0x7fc50fbf0e0a]
[17] /usr/local/lib/libpython3.6m.so.1.0(+0x18badb)[0x7fc50fbf0adb]
[18] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x34f)[0x7fc50fbe8a6f]
[19] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fc50fbf2acb]
[20] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fc50fb5e86e]
[21] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fc50fb5f1f1]
[22] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fc50fb5ec17]
[23] /usr/local/lib/libpython3.6m.so.1.0(+0x147506)[0x7fc50fbac506]
[24] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fc50fb5ec17]
[25] /usr/local/lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x1c69)[0x7fc50fbea389]
[26] /usr/local/lib/libpython3.6m.so.1.0(_PyFunction_FastCallDict+0x41b)[0x7fc50fbf2acb]
[27] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x10e)[0x7fc50fb5e86e]
[28] /usr/local/lib/libpython3.6m.so.1.0(_PyObject_Call_Prepend+0x61)[0x7fc50fb5f1f1]
[29] /usr/local/lib/libpython3.6m.so.1.0(PyObject_Call+0x47)[0x7fc50fb5ec17]
*** End of error message ***
EDIT: Segfault problem seems to have stopped after #798
For speeding up training process
The performance of the trained model at this moment (should be 33.9mmAP)
mmAP (all): 0.31496558
mmAP (large): 0.46482706
mmAP (medium): 0.33771247
mmAP (small): 0.14785959
~With the network trained previously~ I forgot how I got this weight...
mmAP (all): 0.34748608
mmAP (large): 0.5130962
mmAP (medium): 0.37304157
mmAP (small): 0.16328236
EDIT:
I suspected that gt_bbox
should be calculcated by mask_to_bbox
.
It turns out that this is irrelevant to the performance drop.
https://github.com/chainer/chainercv/blob/7e707d8ab247a03de433135c59ef2bf3fd9de35b/chainercv/links/model/mask_rcnn/mask_head.py#L258
After changing the script to use mask_to_bbox
mmAP (all): 0.31432453
mmAP (large): 0.46316248
mmAP (medium): 0.33836472
mmAP (small): 0.14966531
EDIT:
I tried normalizing mask_loss over all RoIs used by one GPU.
mmAP (all): 0.31567773
mmAP (large): 0.4621393
mmAP (medium): 0.33851567
mmAP (small): 0.1468797
Finally reproduced. It is critical to implement mask_to_segm
correctly. 2d44d66
mmAP (all): 0.34442434
mmAP (large): 0.50526273
mmAP (medium): 0.36985907
mmAP (small): 0.16083415
Performance of other implementations Detectron: 0.339 maskrcnn-benchmark: 0.342
Performance after deleting +1
to bbox https://github.com/yuyu2172/chainercv/commit/6513e2480e98ce3773ec5566c9923be51f75bb23
The weight is trained with +1
convention.
mmAP (all): 0.33432382
mmAP (large): 0.5024262
mmAP (medium): 0.36352885
mmAP (small): 0.14599578
Performance after deleting +1
to box and training without +1
convention.
mmAP (all): 0.348018
mmAP (large): 0.5091206
mmAP (medium): 0.37551004
mmAP (small): 0.16778706
After https://github.com/chainer/chainercv/pull/781/commits/7e1e3ecee5930c527c7f1c89cdd4719d826095e5 and https://github.com/chainer/chainercv/pull/798
mmAP (all): 0.33900946
mmAP (large): 0.5053451
mmAP (medium): 0.3659876
mmAP (small): 0.1571092
mmAP (all): 0.34121135
mmAP (large): 0.50168735
mmAP (medium): 0.37085357
mmAP (small): 0.15417978
mmAP (all): 0.34130338
mmAP (large): 0.50934994
mmAP (medium): 0.36974284
mmAP (small): 0.152321
mmAP (all): 0.3393448
mmAP (large): 0.50526506
mmAP (medium): 0.36583152
mmAP (small): 0.15407379
mmAP (all): 0.3411054
mmAP (large): 0.5059621
mmAP (medium): 0.3677923
mmAP (small): 0.15057199
https://github.com/yuyu2172/chainercv/commit/743b555e834e228d4875655639f207cfed229043
mmAP (all): 0.3417589
mmAP (large): 0.5081606
mmAP (medium): 0.36930135
mmAP (small): 0.15626407
mmAP (all): 0.35994896
mmAP (large): 0.5378797
mmAP (medium): 0.39182574
mmAP (small): 0.16467538
Uploaded Weight scores
ResNet50 Mask
mmAP (all): 0.34175715
mmAP (large): 0.50817233
mmAP (medium): 0.36931357
mmAP (small): 0.15625913
ResNet50 Bbox
mmAP (all): 0.37964782
mmAP (large): 0.4980176
mmAP (medium): 0.4137187
mmAP (small): 0.22214551
ResNet101 Mask
mmAP (all): 0.35996467
mmAP (large): 0.53794646
mmAP (medium): 0.39182332
mmAP (small): 0.16468713
ResNet101 Bbox
mmAP (all): 0.40394193
mmAP (large): 0.5254983
mmAP (medium): 0.4450884
mmAP (small): 0.23563214
pfnCI, test this please
Successfully created a job for commit 2e3fa01:
pfnCI, test this please
Successfully created a job for commit 900cdf4:
@Hakuyume Please review
TODO:
mask_loss
-> mask_head_loss
bbox_loss
-> bbox_head_loss
pfnCI, test this please
Successfully created a job for commit 5b01be4:
pfnCI, test this please
Successfully created a job for commit 3f265b8:
pfnCI, test this please
Successfully created a job for commit 9be25c7:
pfnCI, test this please
Successfully created a job for commit 171d745:
pfnCI, test this please
Successfully created a job for commit a2d5e6d (3ab0c69):
pfnCI, test this please
Successfully created a job for commit 0820f5d (6ab0a1d):
min_size
and max_size
in train_multi.py
.mask_utils.py
resize
. Make it break when chainer.config.cv_resize_backend == 'cv2'
and cv2
is not installed.cv2
backend in mask_utils.py
pfnCI, test this please
Successfully created a job for commit 6acc48f (2975879):
pfnCI, test this please
Successfully created a job for commit 6acc48f (2975879):
pfnCI, test this please
Successfully created a job for commit 58ef6c7 (15c5cdc):
pfnCI, test this please
Successfully created a job for commit 6c4dc80 (9fad50d):
~Merge after #778~