TimoSaemann / caffe-segnet-cudnn5

This repository was a fork of BVLC/caffe and includes the upsample, bn, dense_image_data and softmax_with_loss (with class weighting) layers of caffe-segnet (https://github.com/alexgkendall/caffe-segnet) to run SegNet with cuDNN version 5.
Other
176 stars 127 forks source link

Segmentation fault #3

Open bosmart opened 7 years ago

bosmart commented 7 years ago

I'm getting the following segmentation fault when running "make runtest". It works fine in the case of the original caffe-segnet (with cuDNN 3.0.8).

[ RUN ] LayerFactoryTest/2.TestCreateLayer Aborted at 1483730734 (unix time) try "date -d @1483730734" if you are using GNU date PC: @ 0x7fe5c0d9cf25 caffe::BasePrefetchingDataLayer<>::~BasePrefetchingDataLayer() SIGSEGV (@0x208) received by PID 8650 (TID 0x7fe5c15d5ac0) from PID 520; stack trace: @ 0x7fe5c033a390 (unknown) @ 0x7fe5c0d9cf25 caffe::BasePrefetchingDataLayer<>::~BasePrefetchingDataLayer() @ 0x7fe5c0e55099 caffe::DataLayer<>::~DataLayer() @ 0xb49c08 caffe::LayerFactoryTest_TestCreateLayer_Test<>::TestBody() @ 0xde7453 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0xde038a testing::Test::Run() @ 0xde04d8 testing::TestInfo::Run() @ 0xde05e5 testing::TestCase::Run() @ 0xde217f testing::internal::UnitTestImpl::RunAllTests() @ 0xde24a3 testing::UnitTest::Run() @ 0x8905cd main @ 0x7fe5ba028830 __libc_start_main @ 0x8973a9 _start @ 0x0 (unknown) Segmentation fault (core dumped) src/caffe/test/CMakeFiles/runtest.dir/build.make:57: recipe for target 'src/caffe/test/CMakeFiles/runtest' failed make[3]: [src/caffe/test/CMakeFiles/runtest] Error 139 CMakeFiles/Makefile2:328: recipe for target 'src/caffe/test/CMakeFiles/runtest.dir/all' failed make[2]: [src/caffe/test/CMakeFiles/runtest.dir/all] Error 2 CMakeFiles/Makefile2:335: recipe for target 'src/caffe/test/CMakeFiles/runtest.dir/rule' failed make[1]: [src/caffe/test/CMakeFiles/runtest.dir/rule] Error 2 Makefile:240: recipe for target 'runtest' failed make: [runtest] Error 2

bosmart commented 7 years ago

I have just noticed this https://github.com/TimoSaemann/caffe-segnet-cudnn5/issues/2 Ubuntu 16.04.1 LTS, CUDA 8.0, GeForce 980Ti.

Interestingly enough, on my second machine with Ubuntu 16.04.1 LTS, CUDA 8.0, Tesla K40 - it works without any issues.

TimoSaemann commented 7 years ago

I can not reproduce that error. I tried it on 3 different machines and no error occurred:

  1. Ubuntu 14.04, CUDA 8.0, Titan X (Pascal), cuDNN v.4 /v.5 /v.5.1, compiled with cmake and make
  2. Ubuntu 14.04, CUDA 7.5, Titan X (Maxwell), cuDNN v.4 /v.5 /v.5.1, compiled with cmake and make
  3. Ubuntu 16, CUDA 8.0, GTX 980, cuDNN v.5.1, compiled with cmake

Did you compiled it with cmake or make? Did you change in your makefile.config something else then uncomment the cuDNN flag? Can you test and train SegNet anyway or which errors do you encounter?

bosmart commented 7 years ago

I have compiled with cmake in both cases i.e.

  1. Ubuntu 16.04.1 - CUDA 8.0 - Tesla K40 (works fine)
  2. Ubuntu 16.04.1 - CUDA 8.0 - GeForce 980Ti or Titan X (produces the fault)

Interestingly enough the fault only happens when caffe process is terminating. So it is able to complete the given number of iterations, save the snapshot etc. and then throws the fault when exiting.

jgorgenucsd commented 7 years ago

I also get this segfault, with cudnn 5.05. As @bosmart mentioned the SegNet trains, saves the solver state, and then apparently caffe's BasePrefetchingDataLayer dies when destructing the model

I0213 09:10:14.745064 29461 solver.cpp:322] Optimization Done. I0213 09:10:14.745074 29461 caffe.cpp:254] Optimization Done. Aborted at 1487005814 (unix time) try "date -d @1487005814" if you are using GNU date PC: @ 0x7f6497727d1c (unknown) SIGSEGV (@0xfffffff7) received by PID 29461 (TID 0x7f6499c259c0) from PID 18446744073709551607; stack trace: @ 0x7f64976dbcb0 (unknown) @ 0x7f6497727d1c (unknown) @ 0x7f649951c68b caffe::BasePrefetchingDataLayer<>::~BasePrefetchingDataLayer() @ 0x7f64995eeb5b caffe::DenseImageDataLayer<>::~DenseImageDataLayer() @ 0x7f64995eedb2 boost::detail::sp_counted_impl_p<>::dispose() @ 0x40fcd1 caffe::Net<>::~Net() @ 0x7f64994459e2 boost::detail::sp_counted_impl_p<>::dispose() @ 0x7f64994ad4b1 caffe::SGDSolver<>::~SGDSolver() @ 0x40dd59 boost::detail::shared_count::~shared_count() @ 0x40b5d1 train() @ 0x408363 main @ 0x7f64976c6f45 (unknown) @ 0x408ce1 (unknown) @ 0x0 (unknown) Segmentation fault (core dumped)

ilia-nikiforov commented 7 years ago

Very similar issue here. CUDNN 5.1, CUDA 8.0, GeForce GTX 860M, Ubuntu 16.04. Various failed tests on runtest with both cmake and make, but SegNet runs and trains fine. However, if I'm using an LMDB data layer, I get a segmentation fault at the end of all runs, after everything is calculated and saved. If I put the del net command in any python script after initializing net, I get a segmentation fault. DenseImageData works fine, however. @bosmart @jgorgenucsd are you using DenseImageData input or some other type of input layer?

xiaozai commented 7 years ago

Hi, I have the exactly same error, how do you solve it? thanks

drewbo commented 6 years ago

Having this same error (trains, saves solver state, fails); we're you able to reproduce @TimoSaemann? I can send along my full workflow shortly if that helps

vsuryamurthy commented 6 years ago

I am having the same error when I use lmdb. Does anyone the reason for the segmentation fault?

ilia-nikiforov commented 6 years ago

As with others here, my problem disappeared when I switched machines. My particular switch was from a laptop with a GTX860M to a desktop with a GTX1070.