facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.27k stars 5.45k forks source link

"python" received signal SIGSEGV, Segmentation fault in _int_malloc at malloc.c #352

Open blateyang opened 6 years ago

blateyang commented 6 years ago

I am trying to use R-50-FPN model of Detectron to train my own dataset. But while training, I will always run into SIGSEGV error. I have searched the google but can't find very useful solutions. Can anyone who is experienced at analysing similar segmentation fault help me? Thanks in advance. The followings are some debug information using python-dbg, I don't know how to analyse them.

json_stats: {"accuracy_cls": 0.974609, "eta": "1:16:04", "iter": 2500, "loss": 0.114559, "loss_bbox": 0.053842, "loss_cls": 0.051955, "loss_rpn_bbox_fpn2": 0.000000, "loss_rpn_bbox_fpn3": 0.000000, "loss_rpn_bbox_fpn4": 0.000000, "loss_rpn_bbox_fpn5": 0.011600, "loss_rpn_bbox_fpn6": 0.000000, "loss_rpn_cls_fpn2": 0.000000, "loss_rpn_cls_fpn3": 0.000011, "loss_rpn_cls_fpn4": 0.000102, "loss_rpn_cls_fpn5": 0.003746, "loss_rpn_cls_fpn6": 0.000000, "lr": 0.002500, "mb_qsize": 64, "mem": 3274, "time": 0.115558}

Thread 29 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffedb7fe700 (LWP 25362)]
0x00007ffff7872545 in _int_malloc (av=av@entry=0x7ffebc000020, 
    bytes=bytes@entry=768) at malloc.c:3727
3727    malloc.c: No such file of directory
(gdb) where
#0  0x00007ffff7872545 in _int_malloc (av=av@entry=0x7ffebc000020, 
    bytes=bytes@entry=768) at malloc.c:3727
#1  0x00007ffff7874184 in __GI___libc_malloc (bytes=768) at malloc.c:2913
#2  0x00007fffeee70c51 in PyDataMem_NEW (size=768)
    at numpy/core/src/multiarray/alloc.c:201
#3  _npy_alloc_cache (alloc=0x7fffeee70f70 <PyDataMem_NEW>, 
    cache=0x7fffef242800 <datacache>, msz=1024, esz=1, nelem=nelem@entry=768)
    at numpy/core/src/multiarray/alloc.c:66
#4  npy_alloc_cache (sz=sz@entry=768) at numpy/core/src/multiarray/alloc.c:94
#5  0x00007fffeeed39aa in PyArray_NewFromDescr_int (
    subtype=0x7fffef22bf40 <PyArray_Type>, descr=0x7fffef22ce60 <FLOAT_Descr>, 
    nd=1, dims=0x7ffedb7fbdf0, strides=<optimized out>, data=0x0, flags=0, 
    obj=0x0, zeroed=0, allow_emptystring=0)
    at numpy/core/src/multiarray/ctors.c:1062
#6  0x00007fffeeed3a6a in PyArray_NewFromDescr (
    subtype=subtype@entry=0x7fffef22bf40 <PyArray_Type>, 
    descr=<optimized out>, nd=nd@entry=1, dims=dims@entry=0x7ffedb7fbdf0, 
    strides=<optimized out>, data=data@entry=0x0, flags=flags@entry=0, 
    obj=obj@entry=0x0) at numpy/core/src/multiarray/ctors.c:1146
#7  0x00007fffeef8195d in npyiter_new_temp_array (
    iter=iter@entry=0x7ffebc0a91b0, subtype=0x7fffef22bf40 <PyArray_Type>, 
    flags=flags@entry=11880, op_itflags=op_itflags@entry=0x7ffebc0a926c, 
    op_ndim=op_ndim@entry=1, op_dtype=<optimized out>, op_axes=0x0, 
---Type <return> to continue, or q <return> to quit---
    shape=0x7ffedb7fbdf0) at numpy/core/src/multiarray/nditer_constr.c:2669
#8  0x00007fffeef822b5 in npyiter_allocate_arrays (
    iter=iter@entry=0x7ffebc0a91b0, flags=flags@entry=11880, 
    op_dtype=op_dtype@entry=0x7ffebc0a91f8, 
    subtype=subtype@entry=0x7fffef22bf40 <PyArray_Type>, 
    op_flags=op_flags@entry=0x7ffedb7fc330, 
    op_itflags=op_itflags@entry=0x7ffebc0a9268, op_axes=op_axes@entry=0x0)
    at numpy/core/src/multiarray/nditer_constr.c:2823
#9  0x00007fffeef83008 in NpyIter_AdvancedNew (nop=<optimized out>, 
    op_in=<optimized out>, flags=11880, order=<optimized out>, 
    casting=NPY_UNSAFE_CASTING, op_flags=0x7ffedb7fc330, 
    op_request_dtypes=0x7ffedb7fc3b0, oa_ndim=-1, op_axes=0x0, itershape=0x0, 
    buffersize=<optimized out>)
    at numpy/core/src/multiarray/nditer_constr.c:404
#10 0x00007fffdaa67816 in iterator_loop (innerloopdata=0x0, 
    innerloop=0x7fffdaa4e420 <FLOAT_add>, arr_prep_args=0x0, 
    arr_prep=0x7ffedb7fc4b0, buffersize=8192, order=NPY_KEEPORDER, 
    dtype=0x7ffedb7fc3b0, op=<optimized out>, ufunc=0xa55290)
    at numpy/core/src/umath/ufunc_object.c:1247
#11 execute_legacy_ufunc_loop (arr_prep_args=0x0, arr_prep=0x7ffedb7fc4b0, 
    buffersize=8192, order=NPY_KEEPORDER, dtypes=0x7ffedb7fc3b0, 
    op=<optimized out>, trivial_loop_ok=<optimized out>, ufunc=0xa55290)
    at numpy/core/src/umath/ufunc_object.c:1485
---Type <return> to continue, or q <return> to quit---
#12 PyUFunc_GenericFunction (ufunc=ufunc@entry=0xa55290, args=args@entry=
    (<numpy.ndarray at remote 0x7fff8db77030>, <numpy.ndarray at remote 0x7fff8db77bc0>), kwds=kwds@entry=0x0, op=op@entry=0x7ffedb7fc830)
    at numpy/core/src/umath/ufunc_object.c:2495
#13 0x00007fffdaa68757 in ufunc_generic_call (ufunc=ufunc@entry=0xa55290, 
    args=args@entry=(<numpy.ndarray at remote 0x7fff8db77030>, <numpy.ndarray at remote 0x7fff8db77bc0>), kwds=kwds@entry=0x0)
    at numpy/core/src/umath/ufunc_object.c:4137
#14 0x00000000004ae223 in PyObject_Call (kw=0x0, 
    arg=(<numpy.ndarray at remote 0x7fff8db77030>, <numpy.ndarray at remote 0x7fff8db77bc0>), func=<numpy.ufunc at remote 0xa55290>)
    at ../Objects/abstract.c:2546
#15 PyObject_CallFunctionObjArgs () at ../Objects/abstract.c:2773
#16 0x00007fffeef8b1cb in PyArray_GenericBinaryFunction (op=<optimized out>, 
    m2=<numpy.ndarray at remote 0x7fff8db77bc0>, m1=0x7fff8db77030)
    at numpy/core/src/multiarray/number.c:269
#17 array_add (m1=0x7fff8db77030, m2=<numpy.ndarray at remote 0x7fff8db77bc0>)
    at numpy/core/src/multiarray/number.c:312
#18 0x0000000000509d5c in binary_op1.lto_priv.1984 (op_slot=0, 
    w=<numpy.ndarray at remote 0x7fff8db77bc0>, 
    v=<numpy.ndarray at remote 0x7fff8db77030>) at ../Objects/abstract.c:945
#19 PyNumber_Add () at ../Objects/abstract.c:1185
#20 0x00000000004c1c39 in PyEval_EvalFrameEx () at ../Python/ceval.c:1484
---Type <return> to continue, or q <return> to quit---
#21 0x00000000004b9ab6 in PyEval_EvalCodeEx () at ../Python/ceval.c:3582
#22 0x00000000004c16e7 in fast_function (nk=<optimized out>, 
    na=<optimized out>, n=<optimized out>, pp_stack=0x7ffedb7fd090, 
    func=<function at remote 0x7fffaa7f0e60>) at ../Python/ceval.c:4445
#23 call_function (oparg=<optimized out>, pp_stack=0x7ffedb7fd090)
    at ../Python/ceval.c:4370
#24 PyEval_EvalFrameEx () at ../Python/ceval.c:2987
#25 0x00000000004c136f in fast_function (nk=<optimized out>, 
    na=<optimized out>, n=5, pp_stack=0x7ffedb7fd1b0, 
    func=<function at remote 0x7fffa609e8c0>) at ../Python/ceval.c:4435
#26 call_function (oparg=<optimized out>, pp_stack=0x7ffedb7fd1b0)
    at ../Python/ceval.c:4370
#27 PyEval_EvalFrameEx () at ../Python/ceval.c:2987
#28 0x00000000004b9ab6 in PyEval_EvalCodeEx () at ../Python/ceval.c:3582
#29 0x00000000004d54b9 in function_call.lto_priv ()
    at ../Objects/funcobject.c:523
#30 0x00000000004eebee in PyObject_Call (kw=0x0, arg=
    (<GenerateProposalsOp(_feat_stride=<float at remote 0x260ff90>, _num_anchors=3, _train=True, _anchors=<numpy.ndarray at remote 0x7fff9fae4b20>) at remote 0x7fff9fb80f10>, [<caffe2.python.caffe2_pybind11_state_gpu.TensorCPU at remote 0x7fff844bace0>, <caffe2.python.caffe2_pybind11_state_gpu.TensorCPU at remote 0x7fff844ba998>, <caffe2.python.caffe2_pybind11_state_gpu.TensorCPU at remote 0x7fff844ba880>], [<caffe2.python.caffe2_pybind11_state_gpu.TensorCPU at remote 0x7fff8---Type <return> to continue, or q <return> to quit---
44ba260>, <caffe2.python.caffe2_pybind11_state_gpu.TensorCPU at remote 0x7fff844baed8>]), func=<function at remote 0x7fffa609e848>)
    at ../Objects/abstract.c:2546
#31 instancemethod_call.lto_priv () at ../Objects/classobject.c:2602
#32 0x00000000004a577e in PyObject_Call () at ../Objects/abstract.c:2546
#33 0x00000000004c5e10 in PyEval_CallObjectWithKeywords ()
    at ../Python/ceval.c:4219
#34 0x00007fffec2d2fb7 in pybind11::object pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, std::vector<pybind11::object, std::allocator<pybind11::object> >&, std::vector<pybind11::object, std::allocator<pybind11::object> >&>(std::vector<pybind11::object, std::allocator<pybind11::object> >&, std::vector<pybind11::object, std::allocator<pybind11::object> >&) const () from /usr/local/caffe2/python/caffe2_pybind11_state_gpu.so
#35 0x00007fffec2d4e01 in caffe2::python::PythonOpBase<caffe2::CPUContext, false>::RunOnDevice() () from /usr/local/caffe2/python/caffe2_pybind11_state_gpu.so
#36 0x00007fffec29a55b in caffe2::Operator<caffe2::CPUContext>::Run(int) ()
   from /usr/local/caffe2/python/caffe2_pybind11_state_gpu.so
#37 0x00007fffec2e70b8 in caffe2::GPUFallbackOp<caffe2::python::PythonOp<caffe2::CPUContext, false>, caffe2::SkipIndices<> >::RunOnDevice() ()
   from /usr/local/caffe2/python/caffe2_pybind11_state_gpu.so
#38 0x00007fffec2e3675 in caffe2::Operator<caffe2::CUDAContext>::Run(int) ()
   from /usr/local/caffe2/python/caffe2_pybind11_state_gpu.so
#39 0x00007fffcdf0d98a in caffe2::DAGNet::RunAt(int, std::vector<int, std::alloc---Type <return> to continue, or q <return> to quit---
ator<int> > const&) () from /usr/local/lib/libcaffe2.so
#40 0x00007fffcdf0c87c in caffe2::DAGNetBase::WorkerFunction() ()
   from /usr/local/lib/libcaffe2.so
#41 0x00007ffff133ac80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#42 0x00007ffff7bc16ba in start_thread (arg=0x7ffedb7fe700)
    at pthread_create.c:333
#43 0x00007ffff78f741d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Detailed steps to reproduce

python train_net.py

System information

Operating system: Ubuntu16.04 Compiler version: gcc 5.4.0 CUDA version: 8.0 cuDNN version: 5.0 NVIDIA driver version: 384.111 GPU models (for all devices if they are not all the same): PYTHONPATH environment variable: /usr/local /home/ygj/caffe2/build python --version output: Python 2.7.12

nathlacroix commented 6 years ago

Any update on this ? Got the same problem