Aborting error at 7000 iterations on my own dataset.

zxDeepDiver commented 6 years ago

Hi, I am using Detectron to train the end-to-end FPN based mask rcnn on my own dataset. But I meet the following error after several thousand iterations:

I rerun the training script for several times, but the error still occurs, at some step after 5000 iterations.

According to the sixth line from the bottom, _ZN6caffe213GPUFallbackOpINS_6python8PythonOpINS_10CPUContextELb0EEENS_11SkipIndicesIJEEEE11RunOnDeviceEv , it looks like the error comes from the GPUFallBackOp function.

Besides, I can run the example script on coco and others of my own datasets successfully, without any error.

Could you tell me how to debug and locate the mistakes? Thanks.

AutomanHan commented 6 years ago

The version of some package may be not satisfied the requirement. I meet the same errors which is caused by the version of pyyaml. The version of mine is 3.10. I upgrade it and it solved. pip install numpy>=1.13 pyyaml>=3.12 matplotlib opencv-python>=3.2 setuptools Cython mock scipy

zxDeepDiver commented 6 years ago

@AutomanHan Thank you for you reply. However, I have checked the version of my pyyaml, and found it is already the version 3.12. I think it might be some error about the dataset. But I don't know how to locate that.

blateyang commented 6 years ago

@zxDeepDiver I have almost the same problem like yours(See the bottom). It will always abort randomly while I train R-50-FPN based faster rcnn on my own dataset. I have also check out the correctness of coco json file of my own dataset. I guess there may be something wrong in accessing or manipulating the memory, but I don't know where is it. Hope someone who knows this error can help us. Many thanks. 2018-04-10 12-53-28

shiyongde commented 6 years ago

The same problem for me , It just happen in faster rcnn model

Aborted at 1529752775 (unix time) try "date -d @1529752775" if you are using GNU date PC: @ 0x7fde67a8aacd __libc_malloc SIGSEGV (@0x0) received by PID 28203 (TID 0x7fdd2b72e700) from PID 0; stack trace: @ 0x7fde67864390 (unknown) @ 0x7fde67a8aacd __libc_malloc @ 0x7fde5e4d6c51 (unknown) @ 0x7fde5e5392d2 (unknown) @ 0x7fde5e53938a (unknown) @ 0x7fde300fe086 (unknown) @ 0x7fde300fe5ef (unknown) @ 0x4ae223 PyObject_CallFunctionObjArgs @ 0x7fde5e5f0c7b (unknown) @ 0x509d5c PyNumber_Add @ 0x4c1c39 PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c16e7 PyEval_EvalFrameEx @ 0x4c136f PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4d54b9 (unknown) @ 0x4eebee (unknown) @ 0x4a577e PyObject_Call @ 0x4c5e10 PyEval_CallObjectWithKeywords @ 0x7fdddf37cb70 pybind11::detail::object_api<>::operator()<>() @ 0x7fdddf37e2f1 caffe2::python::PythonOpBase<>::RunOnDevice() @ 0x7fdddf34658b caffe2::Operator<>::Run() @ 0x7fdddf39187a _ZN6caffe213GPUFallbackOpINS_6python8PythonOpINS_10CPUContextELb0EEENS_11SkipIndicesIJEEEE11RunOnDeviceEv @ 0x7fdddf38d0b5 caffe2::Operator<>::Run() @ 0x7fdddd7edfda caffe2::DAGNet::RunAt() @ 0x7fdddd7ecc91 caffe2::DAGNetBase::WorkerFunction() @ 0x7fde66895c80 (unknown) @ 0x7fde6785a6ba start_thread @ 0x7fde6759041d clone @ 0x0 (unknown)

facebookresearch / Detectron

Aborting error at 7000 iterations on my own dataset. #314