Open zxDeepDiver opened 6 years ago
The version of some package may be not satisfied the requirement. I meet the same errors which is caused by the version of pyyaml. The version of mine is 3.10. I upgrade it and it solved.
pip install numpy>=1.13 pyyaml>=3.12 matplotlib opencv-python>=3.2 setuptools Cython mock scipy
@AutomanHan Thank you for you reply. However, I have checked the version of my pyyaml, and found it is already the version 3.12. I think it might be some error about the dataset. But I don't know how to locate that.
@zxDeepDiver I have almost the same problem like yours(See the bottom). It will always abort randomly while I train R-50-FPN based faster rcnn on my own dataset. I have also check out the correctness of coco json file of my own dataset. I guess there may be something wrong in accessing or manipulating the memory, but I don't know where is it. Hope someone who knows this error can help us. Many thanks.
The same problem for me , It just happen in faster rcnn model
Aborted at 1529752775 (unix time) try "date -d @1529752775" if you are using GNU date PC: @ 0x7fde67a8aacd __libc_malloc SIGSEGV (@0x0) received by PID 28203 (TID 0x7fdd2b72e700) from PID 0; stack trace: @ 0x7fde67864390 (unknown) @ 0x7fde67a8aacd __libc_malloc @ 0x7fde5e4d6c51 (unknown) @ 0x7fde5e5392d2 (unknown) @ 0x7fde5e53938a (unknown) @ 0x7fde300fe086 (unknown) @ 0x7fde300fe5ef (unknown) @ 0x4ae223 PyObject_CallFunctionObjArgs @ 0x7fde5e5f0c7b (unknown) @ 0x509d5c PyNumber_Add @ 0x4c1c39 PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4c16e7 PyEval_EvalFrameEx @ 0x4c136f PyEval_EvalFrameEx @ 0x4b9ab6 PyEval_EvalCodeEx @ 0x4d54b9 (unknown) @ 0x4eebee (unknown) @ 0x4a577e PyObject_Call @ 0x4c5e10 PyEval_CallObjectWithKeywords @ 0x7fdddf37cb70 pybind11::detail::object_api<>::operator()<>() @ 0x7fdddf37e2f1 caffe2::python::PythonOpBase<>::RunOnDevice() @ 0x7fdddf34658b caffe2::Operator<>::Run() @ 0x7fdddf39187a _ZN6caffe213GPUFallbackOpINS_6python8PythonOpINS_10CPUContextELb0EEENS_11SkipIndicesIJEEEE11RunOnDeviceEv @ 0x7fdddf38d0b5 caffe2::Operator<>::Run() @ 0x7fdddd7edfda caffe2::DAGNet::RunAt() @ 0x7fdddd7ecc91 caffe2::DAGNetBase::WorkerFunction() @ 0x7fde66895c80 (unknown) @ 0x7fde6785a6ba start_thread @ 0x7fde6759041d clone @ 0x0 (unknown)
Hi, I am using Detectron to train the end-to-end FPN based mask rcnn on my own dataset. But I meet the following error after several thousand iterations:
I rerun the training script for several times, but the error still occurs, at some step after 5000 iterations.
According to the sixth line from the bottom, _ZN6caffe213GPUFallbackOpINS_6python8PythonOpINS_10CPUContextELb0EEENS_11SkipIndicesIJEEEE11RunOnDeviceEv , it looks like the error comes from the GPUFallBackOp function.
Besides, I can run the example script on coco and others of my own datasets successfully, without any error.
Could you tell me how to debug and locate the mistakes? Thanks.