Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR

dheerajmk commented 6 years ago

I have implemented this ENet project on nvidia Jetson TX2 with jetpack 3.0 (cuda 8 , cudnn 5.1 , ubuntu 16.04) . and during the training of the encoder , the error which is stated below arises. as some forums suggested to use "sudo" ,which i did , but also the error remains. please suggest me the solution to remove this error.

my cmake summary is as follows. Caffe Configuration Summary -- General: -- Version : 1.0.0-rc3 -- Git : 22d356c -- System : Linux -- C++ compiler : /usr/bin/c++ -- Release CXX flags : -O3 -DNDEBUG -fPIC -Wall -Wno-sign-compare -Wno-uninitialized -- Debug CXX flags : -g -fPIC -Wall -Wno-sign-compare -Wno-uninitialized -- Build type : Release -- BUILD_SHARED_LIBS : ON -- BUILD_python : ON -- BUILD_matlab : OFF -- BUILD_docs : ON -- CPU_ONLY : OFF -- USE_OPENCV : ON -- USE_LEVELDB : ON -- USE_LMDB : ON -- ALLOW_LMDB_NOLOCK : OFF -- Dependencies: -- BLAS : Yes (Atlas) -- Boost : Yes (ver. 1.61) -- glog : Yes -- gflags : Yes -- protobuf : Yes (ver. 3.1.0) -- lmdb : Yes (ver. 0.9.17) -- LevelDB : Yes (ver. 1.18) -- Snappy : Yes (ver. 1.1.3) -- OpenCV : Yes (ver. 2.4.13) -- CUDA : Yes (ver. 8.0) -- NVIDIA CUDA: -- Target GPU(s) : Auto -- GPU arch(s) : sm_62 -- cuDNN : Yes (ver. 5.1.10) -- Python: -- Interpreter : /usr/bin/python2.7 (ver. 2.7.12) -- Libraries : /usr/lib/aarch64-linux-gnu/libpython2.7.so (ver 2.7.12) -- NumPy : /usr/local/lib/python2.7/dist-packages/numpy/core/include (ver 1.14.2) -- Documentaion: -- Doxygen : /usr/bin/doxygen (1.8.11) -- config_file : /home/nvidia/ENet/caffe-enet/.Doxyfile -- Install: -- Install path : /home/nvidia/ENet/caffe-enet/build/install -- Configuring done -- Generating done -- Build files have been written to: /home/nvidia/ENet/caffe-enet/build

and the error is as follows at the training of encoder stage

I0411 10:37:39.844830 4349 layer_factory.hpp:77] Creating layer conv3_3_1 I0411 10:37:39.844874 4349 net.cpp:100] Creating Layer conv3_3_1 I0411 10:37:39.844895 4349 net.cpp:434] conv3_3_1 <- conv3_3_1_a I0411 10:37:39.844918 4349 net.cpp:408] conv3_3_1 -> conv3_3_1 F0411 10:37:39.854919 4349 cudnn_conv_layer.cpp:53] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR Check failure stack trace: @ 0x7f98134718 google::LogMessage::Fail() @ 0x7f98136614 google::LogMessage::SendToLog() @ 0x7f98134290 google::LogMessage::Flush() @ 0x7f98136eb4 google::LogMessageFatal::~LogMessageFatal() @ 0x7f9840c988 caffe::CuDNNConvolutionLayer<>::LayerSetUp() @ 0x7f98438774 caffe::Net<>::Init() @ 0x7f98439ff0 caffe::Net<>::Net() @ 0x7f9841b510 caffe::Solver<>::InitTestNets() @ 0x7f9841bd84 caffe::Solver<>::Init() @ 0x7f9841c034 caffe::Solver<>::Solver() @ 0x7f98455c7c caffe::Creator_AdamSolver<>() @ 0x40c3cc train() @ 0x4093e0 main @ 0x7f9777e8a0 __libc_start_main Aborted (core dumped)

vsuryamurthy commented 6 years ago

If you have not solved the problem yet, I suggest you check the following: i) Check if GPU has enough memory. The image resolution or the batch size might be large. ii) You might have to change the name of the last layer if you are using different number of classes (This is only if you are fine-tuning a pretrained network).

WellXiong commented 6 years ago

If anything you have tried and you still don't deal it, try " sudo rm -rf ~/.nv/"

ASONG0506 commented 4 years ago

If anything you have tried and you still don't deal it, try " sudo rm -rf ~/.nv/"

That really worked for me, THX!

yuxwind commented 4 years ago

Actually, I was out of GPU memory. After killing some application, the error is fixed.

Xpangz commented 3 years ago

when I was troulbed in this problem, my finally ways was just changed the gpu ID. Caffe default gpu number is 0, so if your gpu 0 is occupied，you‘d better change thd ID，otherwise the error 'Abort(core dumped)' could be generated.

TimoSaemann / ENet

Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR #64