NVIDIA / DIGITS

Deep Learning GPU Training System
https://developer.nvidia.com/digits
BSD 3-Clause "New" or "Revised" License
4.12k stars 1.38k forks source link

Object Detection on KITTI dataset Error Code -11 #1833

Open arthurlobo opened 7 years ago

arthurlobo commented 7 years ago

I am getting an Error Code 11 (Invalid memory reference) when starting to train a DetectNet on the KITTI dataset.

My environment is:

DIGITS 6.0.0-rc.2 NV-Caffe 0.15.14 OpenCV 3.0.0 CuDNN 7.0.2 CUDA 8.0.64 Protobuf 3.2 NVIDIA Graphics Driver 384.9 Ubuntu 16.04 LTS GTX 1070

and I am using Batch size = 4, Batch accumulation = 6.

Also while building NV-Caffe I have commented out "USE_OPENCV := 0" and uncommented "OPENCV_VERSION := 3" and "WITH_PYTHON_LAYER := 1" in Makefile.config.

The following is the output of the last few lines of caffe_output.log:

I0925 15:51:06.247201 25944 net.cpp:94] Creating Layer bbox_loss I0925 15:51:06.247203 25944 net.cpp:435] bbox_loss <- bboxes-obj-masked-norm I0925 15:51:06.247205 25944 net.cpp:435] bbox_loss <- bbox-obj-label-norm I0925 15:51:06.247207 25944 net.cpp:409] bbox_loss -> loss_bbox I0925 15:51:06.247241 25944 net.cpp:144] Setting up bbox_loss I0925 15:51:06.247243 25944 net.cpp:151] Top shape: (1) I0925 15:51:06.247246 25944 net.cpp:154] with loss weight 2 I0925 15:51:06.247251 25944 net.cpp:159] Memory required for data: 2549998340 I0925 15:51:06.247253 25944 layer_factory.hpp:77] Creating layer coverage_loss I0925 15:51:06.247256 25944 net.cpp:94] Creating Layer coverage_loss I0925 15:51:06.247257 25944 net.cpp:435] coverage_loss <- coverage_coverage/sig_0_split_0 I0925 15:51:06.247259 25944 net.cpp:435] coverage_loss <- coverage-label_slice-label_4_split_0 I0925 15:51:06.247262 25944 net.cpp:409] coverage_loss -> loss_coverage I0925 15:51:06.247285 25944 net.cpp:144] Setting up coverage_loss I0925 15:51:06.247287 25944 net.cpp:151] Top shape: (1) I0925 15:51:06.247289 25944 net.cpp:154] with loss weight 1 I0925 15:51:06.247292 25944 net.cpp:159] Memory required for data: 2549998344 I0925 15:51:06.247293 25944 layer_factory.hpp:77] Creating layer cluster Aborted at 1506369066 (unix time) try "date -d @1506369066" if you are using GNU date PC: @ 0x7fc4b085f873 std::_Hashtable<>::clear() SIGSEGV (@0x9) received by PID 25944 (TID 0x7fc6e9c20740) from PID 9; stack trace: @ 0x7fc6e78624b0 (unknown) @ 0x7fc4b085f873 std::_Hashtable<>::clear() @ 0x7fc4b0851346 google::protobuf::DescriptorPool::FindFileByName() @ 0x7fc4b082fac8 google::protobuf::python::cdescriptor_pool::AddSerializedFile() @ 0x7fc6e84997d0 PyEval_EvalFrameEx @ 0x7fc6e85c201c PyEval_EvalCodeEx @ 0x7fc6e85183dd (unknown) @ 0x7fc6e84eb1e3 PyObject_Call @ 0x7fc6e850bae5 (unknown) @ 0x7fc6e84a2123 (unknown) @ 0x7fc6e84eb1e3 PyObject_Call @ 0x7fc6e849613c PyEval_EvalFrameEx @ 0x7fc6e85c201c PyEval_EvalCodeEx @ 0x7fc6e8490b89 PyEval_EvalCode @ 0x7fc6e85251b4 PyImport_ExecCodeModuleEx @ 0x7fc6e8525b8f (unknown) @ 0x7fc6e8527300 (unknown) @ 0x7fc6e85275c8 (unknown) @ 0x7fc6e85286db PyImport_ImportModuleLevel @ 0x7fc6e849f698 (unknown) @ 0x7fc6e84eb1e3 PyObject_Call @ 0x7fc6e85c1447 PyEval_CallObjectWithKeywords @ 0x7fc6e84945c6 PyEval_EvalFrameEx @ 0x7fc6e85c201c PyEval_EvalCodeEx @ 0x7fc6e8490b89 PyEval_EvalCode @ 0x7fc6e85251b4 PyImport_ExecCodeModuleEx @ 0x7fc6e8525b8f (unknown) @ 0x7fc6e8527300 (unknown) @ 0x7fc6e85275c8 (unknown) @ 0x7fc6e85286db PyImport_ImportModuleLevel @ 0x7fc6e849f698 (unknown) @ 0x7fc6e84eb1e3 PyObject_Call

The same DetectNet model was trained with DIGITS 5 (with OpenCV 2.4.9) earlier without any error codes.

Can anyone help?

arthurlobo commented 7 years ago

The issue has been solved. I reinstalled the DIGITS environment on a new Ubuntu 16.04.3 LTS installation. NV-Caffe version shows as 0.16.4. This is the only difference from the versions I reported earlier.

szm-R commented 7 years ago

Sorry, does anyone know the reason of this problem? I'm facing the exact error but with a custom dataset, I've been using DIGITS for a while and have been training object detection models with it without a problem. I have reinstalled my Ubuntu recently (16.04.3 LTS) and so I had to reinstall DIGITS and NVcaffe, I'm using the following versions for these two: DIGITS: 6.0.0-rc.2 NVcaffe: 0.15.14

The error occurs when the network reaches cluster layer: I1008 09:16:17.589089 3765 layer_factory.hpp:77] Creating layer cluster Aborted at 1507441578 (unix time) try "date -d @1507441578" if you are using GNU date PC: @ 0x7fabd0234873 std::_Hashtable<>::clear() SIGSEGV (@0x9) received by PID 3765 (TID 0x7fac8c174740) from PID 9; stack trace: @ 0x7fac89de14b0 (unknown) @ 0x7fabd0234873 std::_Hashtable<>::clear() @ 0x7fabd0226346 google::protobuf::DescriptorPool::FindFileByName() @ 0x7fabd0204ac8 google::protobuf::python::cdescriptor_pool::AddSerializedFile() @ 0x7fac8aa197d0 PyEval_EvalFrameEx @ 0x7fac8ab4201c PyEval_EvalCodeEx @ 0x7fac8aa983dd (unknown) @ 0x7fac8aa6b1e3 PyObject_Call @ 0x7fac8aa8bae5 (unknown) @ 0x7fac8aa22123 (unknown) @ 0x7fac8aa6b1e3 PyObject_Call @ 0x7fac8aa1613c PyEval_EvalFrameEx @ 0x7fac8ab4201c PyEval_EvalCodeEx @ 0x7fac8aa10b89 PyEval_EvalCode @ 0x7fac8aaa51b4 PyImport_ExecCodeModuleEx @ 0x7fac8aaa5b8f (unknown) @ 0x7fac8aaa7300 (unknown) @ 0x7fac8aaa75c8 (unknown) @ 0x7fac8aaa86db PyImport_ImportModuleLevel @ 0x7fac8aa1f698 (unknown) @ 0x7fac8aa6b1e3 PyObject_Call @ 0x7fac8ab41447 PyEval_CallObjectWithKeywords @ 0x7fac8aa145c6 PyEval_EvalFrameEx @ 0x7fac8ab4201c PyEval_EvalCodeEx @ 0x7fac8aa10b89 PyEval_EvalCode @ 0x7fac8aaa51b4 PyImport_ExecCodeModuleEx @ 0x7fac8aaa5b8f (unknown) @ 0x7fac8aaa7300 (unknown) @ 0x7fac8aaa75c8 (unknown) @ 0x7fac8aaa86db PyImport_ImportModuleLevel @ 0x7fac8aa1f698 (unknown) @ 0x7fac8aa6b1e3 PyObject_Call

This suggests that maybe the problem is the cluster layer, but this layer works without a problem at test time (using the Test option for a pre-trained network). I have also been using NVcaffe with other programs (like a cpp code for model inference and training classification models with it outside DIGITS), so I think that too is working, at least in the mentioned areas. I'm really at a loss as the error isn't really informative...

szm-R commented 7 years ago

I also tried to build NVcaffe 0.16.4 (as suggested by @arthurlobo ), even though the instructions of BuildCaffe.md suggests otherwise (DIGITS is currently compatiable with Caffe 0.15). Nevertheless building results in this error:

/usr/include/c++/5/bits/hashtable.h(1526): error: no instance of overloaded function "std::forward" matches the argument list argument types are: (int) detected during: instantiation of "std::pair<std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::iterator, nv_bool> std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::_M_emplace(std::true_type, _Args &&) [with _Key=int, _Value=std::pair<const int, boost::shared_ptr>, _Alloc=std::allocator<std::pair<const int, boost::shared_ptr>>, _ExtractKey=std::detail::_Select1st, _Equal=std::equal_to, _H1=std::hash, _H2=std::detail::_Mod_range_hashing, _Hash=std::detail::_Default_ranged_hash, _RehashPolicy=std::detail::_Prime_rehash_policy, _Traits=std::umap_traits, _Args=<int &, boost::shared_ptr>]" (726): here instantiation of "std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::ireturn_type std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::emplace(_Args &&...) [with _Key=int, _Value=std::pair<const int, boost::shared_ptr>, _Alloc=std::allocator<std::pair<const int, boost::shared_ptr>>, _ExtractKey=std::__detail::_Select1st, _Equal=std::equal_to, _H1=std::hash, _H2=std::detail::_Mod_range_hashing, _Hash=std::detail::_Default_ranged_hash, _RehashPolicy=std::detail::_Prime_rehash_policy, _Traits=std::umap_traits, _Args=<int &, boost::shared_ptr>]" /usr/include/c++/5/bits/unordered_map.h(380): here instantiation of "std::pair<std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::iterator, nv_bool> std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::emplace(_Args &&...) [with _Key=int, _Tp=boost::shared_ptr, _Hash=std::hash, _Pred=std::equal_to, _Alloc=std::allocator<std::pair<const int, boost::shared_ptr>>, _Args=<int &, boost::shared_ptr>]" /home/szm/Work/Caffe/nv-caffe/include/caffe/layers/cudnn_conv_layer.hpp(35): here instantiation of "T &caffe::map_ptr(int, caffe::PtrMap &, caffe::MutexVec &) [with T=caffe::GPUMemory::Workspace]" /home/szm/Work/Caffe/nv-caffe/src/caffe/layers/cudnn_conv_layer.cu(15): here instantiation of "void caffe::CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu(const std::vector<caffe::Blob , std::allocator<caffe::Blob >> &, const std::vector<caffe::Blob , std::allocator<caffe::Blob >> &) [with Ftype=float, Btype=float]" /home/szm/Work/Caffe/nv-caffe/src/caffe/layers/cudnn_conv_layer.cu(207): here

1 error detected in the compilation of "/tmp/tmpxft_0000400d_00000000-7_cudnn_conv_layer.cpp1.ii". CMake Error at cuda_compile_generated_cudnn_conv_layer.cu.o.cmake:262 (message): Error generating file /home/szm/Work/Caffe/nv-caffe/build/src/caffe/CMakeFiles/cuda_compile.dir/layers/./cuda_compile_generated_cudnn_conv_layer.cu.o

src/caffe/CMakeFiles/caffe.dir/build.make:147: recipe for target 'src/caffe/CMakeFiles/cuda_compile.dir/layers/cuda_compile_generated_cudnn_conv_layer.cu.o' failed make[2]: [src/caffe/CMakeFiles/cuda_compile.dir/layers/cuda_compile_generated_cudnn_conv_layer.cu.o] Error 1 CMakeFiles/Makefile2:272: recipe for target 'src/caffe/CMakeFiles/caffe.dir/all' failed make[1]: [src/caffe/CMakeFiles/caffe.dir/all] Error 2 Makefile:127: recipe for target 'all' failed make: *** [all] Error 2

Please! can anybody guide me to the source of this problem?!