bryanyzhu / Hidden-Two-Stream

Caffe implementation for "Hidden Two-Stream Convolutional Networks for Action Recognition"
Other
196 stars 68 forks source link

Segmentation fault: while training #10

Closed EileenSchreiber closed 6 years ago

EileenSchreiber commented 6 years ago

The Error occurs while starting the training.

Training data: UCF 101

System: Cuda 8.0 Cudnn 5.1 Ubuntu 16.04 OpenCV 3.3 GPU = 2 x nvidia 1060

layer { name: "FlowDeltasUClean6" type: "Concat" bottom: "FlowDeltasUClean6_0" bottom: "FlowDeltasUClean6_1" bottom: "FlowDeltasUClean6_2" bottom: "FlowDeltasUClean6_3" bottom: "FlowDeltasUClean6_4" bottom: I0314 16:56:57.780658 12026 layer_factory.hpp:77] Creating layer data I0314 16:56:57.780689 12026 net.cpp:91] Creating Layer data I0314 16:56:57.780700 12026 net.cpp:400] data -> data I0314 16:56:57.780725 12026 net.cpp:400] data -> label I0314 16:56:57.780798 12026 multi_frame_data_layer.cpp:33] Opening file: ./train_rgb_split1.txt I0314 16:56:57.785921 12026 multi_frame_datalayer.cpp:49] A total of 9537 videos. Aborted at 1521043017 (unix time) try "date -d @1521043017" if you are using GNU date PC: @ 0x7fc548430e67 cv::findDecoder() SIGSEGV (@0x49) received by PID 12026 (TID 0x7fc558e71b00) from PID 73; stack trace: @ 0x7fc55274e4b0 (unknown) @ 0x7fc548430e67 cv::findDecoder() @ 0x7fc548431a01 cv::imread() @ 0x7fc548433e03 cv::imread() @ 0x7fc557d4ef31 caffe::ReadSegmentMultiRGBToDatum() @ 0x7fc557bcb49e caffe::MultiFrameDataLayer<>::DataLayerSetUp() @ 0x7fc557b42753 caffe::BasePrefetchingDataLayer<>::LayerSetUp() @ 0x7fc557afaac2 caffe::Net<>::Init() @ 0x7fc557afc2e1 caffe::Net<>::Net() @ 0x7fc557adbc3a caffe::Solver<>::InitTrainNet() @ 0x7fc557adcf77 caffe::Solver<>::Init() @ 0x7fc557add31a caffe::Solver<>::Solver() @ 0x7fc557d06183 caffe::Creator_SGDSolver<>() @ 0x40a728 train() @ 0x4075e8 main @ 0x7fc552739830 __libc_start_main @ 0x407d59 _start @ 0x0 (unknown) Segmentation fault

Thanks in advance

bryanyzhu commented 6 years ago

This seems to be an OpenCV GPU usage issue, not Caffe. I searched online, and find this thread. Try set the following environment variable to null:

OPENCV_OPENCL_RUNTIME=

If it doesn't work, try do make runtest after make all. This will make sure you have opencv3 compiled correctly with caffe.

Overall, this is an opencv issue. It is very complicated when came to incompatible issue with opencv and caffe. Try different versions of opencv, cudnn, and make clean before you make again.

EileenSchreiber commented 6 years ago

Thank you for your fast reply. I tried doing make runtest now but I get a lot of different errors now.

No.1: [----------] 5 tests from ImageDataLayerTest/1, where TypeParam = caffe::CPUDevice [ RUN ] ImageDataLayerTest/1.TestSpace [ OK ] ImageDataLayerTest/1.TestSpace (93 ms) [ RUN ] ImageDataLayerTest/1.TestShuffle [ OK ] ImageDataLayerTest/1.TestShuffle (128 ms) [ RUN ] ImageDataLayerTest/1.TestRead [ OK ] ImageDataLayerTest/1.TestRead (144 ms) [ RUN ] ImageDataLayerTest/1.TestReshape [ OK ] ImageDataLayerTest/1.TestReshape (58 ms) [ RUN ] ImageDataLayerTest/1.TestResize Aborted at 1521107516 (unix time) try "date -d @1521107516" if you are using GNU date PC: @ 0x7fa59e048acf cv::resize() SIGSEGV (@0x1010000) received by PID 9571 (TID 0x7fa59fedfb00) from PID 16842752; stack trace: @ 0x7fa5965e34b0 (unknown) @ 0x7fa59e048acf cv::resize() @ 0x7fa5975354c7 caffe::ReadImageToCVMat() @ 0x7fa597498dde caffe::ImageDataLayer<>::DataLayerSetUp() @ 0x7fa59732f963 caffe::BasePrefetchingDataLayer<>::LayerSetUp() @ 0x482f2f caffe::Layer<>::SetUp() @ 0x84ea87 caffe::ImageDataLayerTest_TestResize_Test<>::TestBody() @ 0x91bd73 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0x91538a testing::Test::Run() @ 0x9154d8 testing::TestInfo::Run() @ 0x9155b5 testing::TestCase::Run() @ 0x91688f testing::internal::UnitTestImpl::RunAllTests() @ 0x916bb3 testing::UnitTest::Run() @ 0x46f35d main @ 0x7fa5965ce830 __libc_start_main @ 0x476e29 _start @ 0x0 (unknown) Makefile:527: recipe for target 'runtest' failed make: *** [runtest] Segmentation fault

No.2: [----------] 1 test from LayerFactoryTest/1, where TypeParam = caffe::CPUDevice [ RUN ] LayerFactoryTest/1.TestCreateLayer F0315 10:53:45.649158 14945 custom_data_layer.cpp:661] Check failed: !pthreadjoin(thread, NULL) Pthread joining failed. Check failure stack trace: Aborted at 1521107625 (unix time) try "date -d @1521107625" if you are using GNU date @ 0x7f77baf785cd google::LogMessage::Fail() PC: @ 0x7f77af053a18 leveldb::MemTableIterator::value() SIGSEGV (@0x0) received by PID 14945 (TID 0x7f77719d3700) from PID 0; stack trace: @ 0x7f77baf7a433 google::LogMessage::SendToLog() @ 0x7f77b79714b0 (unknown) @ 0x7f77af053a18 leveldb::MemTableIterator::value() @ 0x7f77baf7815b google::LogMessage::Flush() @ 0x7f77af04d452 (unknown) @ 0x7f77baf7ae1e google::LogMessageFatal::~LogMessageFatal() @ 0x7f77b889bf0e _ZN5caffe2db13LevelDBCursor5valueB5cxx11Ev @ 0x7f77b8803622 caffe::CustomDataLayer<>::JoinPrefetchThread() @ 0x7f77b885f1fe caffe::DataReader::Body::read_one() @ 0x7f77b8805a20 caffe::CustomDataLayer<>::~CustomDataLayer() @ 0x7f77b885f6ad caffe::DataReader::Body::InternalThreadEntry() @ 0x7f77b8805ce9 caffe::CustomDataLayer<>::~CustomDataLayer() @ 0x7f77b88e8615 caffe::InternalThread::entry() @ 0x8c75be caffe::LayerFactoryTest_TestCreateLayer_Test<>::TestBody() @ 0x91bd73 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0x91538a testing::Test::Run() @ 0x9154d8 testing::TestInfo::Run() @ 0x7f77b9c185d5 (unknown) @ 0x9155b5 testing::TestCase::Run() @ 0x7f77b7d0d6ba start_thread @ 0x91688f testing::internal::UnitTestImpl::RunAllTests() @ 0x7f77b7a4341d clone @ 0x916bb3 testing::UnitTest::Run() @ 0x46f35d main @ 0x0 (unknown) Makefile:527: recipe for target 'runtest' failed make: *** [runtest] Segmentation fault

No.3: [----------] 5 tests from MemoryDataLayerTest/0, where TypeParam = caffe::CPUDevice [ RUN ] MemoryDataLayerTest/0.AddDatumVectorDefaultTransform [ OK ] MemoryDataLayerTest/0.AddDatumVectorDefaultTransform (0 ms) [ RUN ] MemoryDataLayerTest/0.AddMatVectorDefaultTransform Aborted at 1521107689 (unix time) try "date -d @1521107689" if you are using GNU date PC: @ 0x7f5d41c2d746 strlen SIGSEGV (@0x527) received by PID 15678 (TID 0x7f5d4b4d2b00) from PID 1319; stack trace: @ 0x7f5d41bd74b0 (unknown) @ 0x7f5d41c2d746 strlen @ 0x7f5d41bee980 (unknown) @ 0x7f5d41bef4a6 _IO_vfprintf @ 0x7f5d41cb8754 vsprintf_chk @ 0x7f5d4aff11ec _ZN2cv6formatB5cxx11EPKcz @ 0x7f5d4aff12ab cv::Exception::formatMessage() @ 0x7f5d49c9ac51 cv::Exception::Exception() @ 0x7f5d49c9a67c cv::error() @ 0x7f5d49c490fa cv::InputArray::getMat() @ 0x7f5d49bc8bb8 cv::RNG::fill() @ 0x7f5d49bcc410 cv::randu() @ 0x579ccb caffe::MemoryDataLayerTest_AddMatVectorDefaultTransform_Test<>::TestBody() @ 0x91bd73 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0x91538a testing::Test::Run() @ 0x9154d8 testing::TestInfo::Run() @ 0x9155b5 testing::TestCase::Run() @ 0x91688f testing::internal::UnitTestImpl::RunAllTests() @ 0x916bb3 testing::UnitTest::Run() @ 0x46f35d main @ 0x7f5d41bc2830 libc_start_main @ 0x476e29 _start @ 0x0 (unknown) Makefile:527: recipe for target 'runtest' failed make: *** [runtest] Segmentation fault

No.4: [----------] 1 test from HardSpatialTransformerLayerTest/1, where TypeParam = caffe::CPUDevice [ RUN ] HardSpatialTransformerLayerTest/1.TestGradient Spatial Transformer Layer:: LayerSetUp: Getting outputH and outputW Spatial Transformer Layer:: LayerSetUp: outputH = 5, outputW = 5 Spatial Transformer Layer:: LayerSetUp: Getting pre-defined parameters F0315 10:56:17.828809 16632 st_layer.cpp:102] Check failed: bottom[1]->count(1) + pre_defined_count == 6 The dimension of theta is not six! Only 50 + 0 Check failure stack trace: @ 0x7f82d771c5cd google::LogMessage::Fail() @ 0x7f82d771e433 google::LogMessage::SendToLog() @ 0x7f82d771c15b google::LogMessage::Flush() @ 0x7f82d771ee1e google::LogMessageFatal::~LogMessageFatal() @ 0x7f82d4fb89d4 caffe::SpatialTransformerLayer<>::LayerSetUp() @ 0x482f2f caffe::Layer<>::SetUp() @ 0x4917ff caffe::GradientChecker<>::CheckGradientExhaustive() @ 0x7b7612 caffe::HardSpatialTransformerLayerTest_TestGradient_Test<>::TestBody() @ 0x91bd73 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0x91538a testing::Test::Run() @ 0x9154d8 testing::TestInfo::Run() @ 0x9155b5 testing::TestCase::Run() @ 0x91688f testing::internal::UnitTestImpl::RunAllTests() @ 0x916bb3 testing::UnitTest::Run() @ 0x46f35d main @ 0x7f82d4100830 __libc_start_main @ 0x476e29 _start @ (nil) (unknown) Makefile:527: recipe for target 'runtest' failed make: *** [runtest] Aborted

Do you think all of them have to do with OpenCV3 not really being compiled?

Thank you for your help.

bryanyzhu commented 6 years ago

I encountered error 4 before, it is ok. The code still runs without any problem. Error 2 is about ldb package, maybe you can reinstall ldb.

The first and third error is indeed with opencv3. I think there is some problem in your compilation of opencv3 with GPU support. Just try make clean, and compile opencv again.

EileenSchreiber commented 6 years ago

Thanks again for all your help.

It is finally working up to the point that an out of memory error occurs. So I was trying to reduce the batch_size in Hidden-Two-Stream/models/ucf101_split1_unsup_end/end_train_val.prototxt, but it tells me: "Mask batch size not the same as input batch size". So I was wondering, where else do I have to change the batch size?

Thank you in advance.

bryanyzhu commented 6 years ago

Hi great to know that the code finally works.

You also need to change the batch size here from line 241 to line 444. For each mask I defined, I hard coded it to be 8. Since you have the out of memory issue, you need to change these params as well. Sorry for the hard code, I don't know how to generate .prototxt using python at that time. Hope this helps.

EileenSchreiber commented 6 years ago

Thank you so much for all your help.

Until now it's training. :+1: