TimoSaemann / caffe-segnet-cudnn5

This repository was a fork of BVLC/caffe and includes the upsample, bn, dense_image_data and softmax_with_loss (with class weighting) layers of caffe-segnet (https://github.com/alexgkendall/caffe-segnet) to run SegNet with cuDNN version 5.
Other
176 stars 127 forks source link

A problem when make runtest #4

Closed jamiesoung closed 5 years ago

jamiesoung commented 7 years ago

Issue summary

[----------] 1 test from LayerFactoryTest/2, where TypeParam = caffe::GPUDevice [ RUN ] LayerFactoryTest/2.TestCreateLayer Aborted at 1483801360 (unix time) try "date -d @1483801360" if you are using GNU date PC: @ 0x7f3458fca962 (unknown) SIGSEGV (@0x118) received by PID 1777 (TID 0x7f346b689800) from PID 280; stack trace: @ 0x7f3459321390 (unknown) @ 0x7f3458fca962 (unknown) @ 0x7f3459cd67a5 caffe::BasePrefetchingDataLayer<>::~BasePrefetchingDataLayer() @ 0x7f3459d99e09 caffe::DataLayer<>::~DataLayer() @ 0x4ec5e8 caffe::LayerFactoryTest_TestCreateLayer_Test<>::TestBody() @ 0x8f63d3 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0x8f01ea testing::Test::Run() @ 0x8f0338 testing::TestInfo::Run() @ 0x8f0415 testing::TestCase::Run() @ 0x8f162f testing::internal::UnitTestImpl::RunAllTests() @ 0x8f1943 testing::UnitTest::Run() @ 0x46dacd main @ 0x7f3458f67830 (unknown) @ 0x475509 _start Makefile:526: recipe for target 'runtest' failed make: *** [runtest] Segmentation fault (core dumped)

Steps to reproduce

make runtest -j16

Your system configuration

Operating system:Ubuntu16.04 Compiler:GCC5.3 CUDA version (if applicable):8.0 CUDNN version (if applicable):v5.1 BLAS:atlas Python or MATLAB version (for pycaffe and matcaffe respectively):Python2.7

TimoSaemann commented 7 years ago

I can not reproduce that error. I tried it on 3 different machines and no error occurred:

Ubuntu 14.04, CUDA 8.0, Titan X (Pascal), cuDNN v.4 /v.5 /v.5.1, compiled with cmake and make
Ubuntu 14.04, CUDA 7.5, Titan X (Maxwell), cuDNN v.4 /v.5 /v.5.1, compiled with cmake and make
Ubuntu 16, CUDA 8.0, GTX 980, cuDNN v.5.1, compiled with cmake

Did you compiled it with cmake or make? Did you change in your makefile.config something else then uncomment the cuDNN flag? Can you test and train SegNet anyway or which errors do you encounter?

nathanin commented 7 years ago

Hello, I have now run into this error while building on an AWS g2.2xlarge instance (ubuntu 16.04, CUDA 8, CUDNN 5.1). I was able to make and pass all tests on another ubuntu machine, also 16.04, but which has the K4000 GPU. Also, my OS X El Capitan laptop with the measly GeForce GT 650M was able to make and pass all tests.

I have used make all the time.

In Makefile.config, nothing has changed, except to uncomment cuDNN flag, and add usr/include/hdf5/serial to the INCLUDE_DIRS.

Running the SegNet-Tutorial basic version training gives me the following output:

~/SegNet-Tutorial/Models$ ~/caffe-segnet-cudnn5/build/tools/caffe train --solver ./segnet_basic_solver.prototxt
I0122 20:35:36.243996  3013 caffe.cpp:217] Using GPUs 0
I0122 20:35:36.531260  3013 caffe.cpp:222] GPU 0: GRID K520
F0122 20:35:36.669652  3013 solver_factory.hpp:76] Check failed: registry.count(type) == 1 (0 vs. 1) Unknown solver type: SGD (known types: )
*** Check failure stack trace: ***
    @     0x7f75cac2b5cd  google::LogMessage::Fail()
    @     0x7f75cac2d433  google::LogMessage::SendToLog()
    @     0x7f75cac2b15b  google::LogMessage::Flush()
    @     0x7f75cac2de1e  google::LogMessageFatal::~LogMessageFatal()
    @           0x41cd2a  train()
    @           0x417678  main
    @     0x7f75c7899830  __libc_start_main
    @           0x418dc9  _start
    @              (nil)  (unknown)
Aborted (core dumped)

Any ideas?

TimoSaemann commented 7 years ago

Does the same error occurs, when you compile it with cmake? Have you compiled caffe (master branch) on this machine and can you train with it successfully?

nathanin commented 7 years ago

@TimoSaemann Thank you for the reply. I have just tried with cmake and eventually got the exact same error in runtest.

Master branch compiles and seems to train no problem. I used make to build it, and ran the MNIST example with no issues.

This time I've got some new output:

[----------] 12 tests from DataLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] DataLayerTest/3.TestReadCropTrainLevelDB
*** Error in `/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin': free(): invalid pointer: 0x00007f86eae3a7a0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f86ea5527e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f86ea55ae0a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f86ea55e98c]
/home/ubuntu/caffe-segnet-cudnn5/build/lib/libcaffe.so.1.0.0-rc3(_ZN5caffe24BasePrefetchingDataLayerIdED1Ev+0x37)[0x7f86f1492757]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN5caffe13DataLayerTestINS_9GPUDeviceIdEEE12TestReadCropENS_5PhaseE+0x8f6)[0xb23ab6]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x43)[0xde5923]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing4Test3RunEv+0xba)[0xdde85a]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing8TestInfo3RunEv+0x118)[0xdde9a8]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing8TestCase3RunEv+0xe5)[0xddeab5]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x22f)[0xde064f]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing8UnitTest3RunEv+0x43)[0xde0973]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(main+0x17d)[0x891abd]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f86ea4fb830]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_start+0x29)[0x8973b9]
======= Memory map: ========
00400000-00fcd000 r-xp 00000000 ca:01 813068                             /home/ubuntu/caffe-segnet-cudnn5/.build_release/test/test.testbin
011cc000-0122f000 r--p 00bcc000 ca:01 813068                             /home/ubuntu/caffe-segnet-cudnn5/.build_release/test/test.testbin
0122f000-01231000 rw-p 00c2f000 ca:01 813068                             /home/ubuntu/caffe-segnet-cudnn5/.build_release/test/test.testbin
01231000-01232000 rw-p 00000000 00:00 0
02212000-07380000 rw-p 00000000 00:00 0                                  [heap]
200000000-200100000 rw-s 36092000 00:06 395                              /dev/nvidiactl

Followed by a long output, I'm not sure what it is, and finally the original error.

srinivasnisha commented 7 years ago

It might be failing cause of the presence of multiple GPUs in your system. Try export CUDA_VISIBLE_DEVICES=0 echo $CUDA_VISIBLE_DEVICES -> should show 0 and then run make runtest -j8

hopkinskong commented 6 years ago

sudo apt-get install libtcmalloc-minimal4

Adding it in LD_PRELOAD variable, then make runtest again: export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

You may need to add it in ~/.bashrc too.