Crash on SegNet tutorial

bosmart commented 7 years ago

Please use the caffe-users list for usage, installation, or modeling questions, or other requests for help. Do not post such requests to Issues. Doing so interferes with the development of Caffe.

Please read the guidelines for contributing before submitting this issue.

Issue summary

Segnet training according to their tutorial fails; works fine with the original version of caffe-segnet

Steps to reproduce

~/SegNet$ ./caffe-segnet-cudnn5/build/tools/caffe train -gpu 0 -solver ./Models/segnet_solver.prototxt -weights ~/SegNet/Models/VGG_ILSVRC_16_layers.caffemodel

I1216 09:13:55.123234 3520 caffe.cpp:217] Using GPUs 0 I1216 09:13:55.129607 3520 caffe.cpp:222] GPU 0: Tesla K40c E1216 09:13:55.404479 3520 common.cpp:113] Cannot create Cublas handle. Cublas won't be available. E1216 09:13:55.609668 3520 common.cpp:120] Cannot create Curand generator. Curand won't be available. I1216 09:13:55.609848 3520 solver.cpp:48] Initializing solver from parameters: test_iter: 1 test_interval: 10000000 base_lr: 0.001 display: 20 max_iter: 40000 lr_policy: "step" gamma: 1 momentum: 0.9 weight_decay: 0.0005 stepsize: 10000000 snapshot: 1000 snapshot_prefix: "/home/XXX/SegNet/Models/Training/segnet" solver_mode: GPU device_id: 0 net: "/home/XXX/SegNet/Models/segnet_train.prototxt" train_state { level: 0 stage: "" } test_initialization: false I1216 09:13:55.616778 3520 solver.cpp:91] Creating training net from net file: /home/XXX/SegNet/Models/segnet_train.prototxt [libprotobuf ERROR google/protobuf/text_format.cc:274] Error parsing text-format caffe.NetParameter: 7:26: Message type "caffe.LayerParameter" has no field named "dense_image_data_param". F1216 09:13:55.616905 3520 upgrade_proto.cpp:88] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /home/XXX/SegNet/Models/segnet_train.prototxt Check failure stack trace: @ 0x7f6aed6ab5cd google::LogMessage::Fail() @ 0x7f6aed6ad433 google::LogMessage::SendToLog() @ 0x7f6aed6ab15b google::LogMessage::Flush() @ 0x7f6aed6ade1e google::LogMessageFatal::~LogMessageFatal() @ 0x7f6aeddd3d61 caffe::ReadNetParamsFromTextFileOrDie() @ 0x7f6aedc2468b caffe::Solver<>::InitTrainNet() @ 0x7f6aedc25a77 caffe::Solver<>::Init() @ 0x7f6aedc25e1a caffe::Solver<>::Solver() @ 0x7f6aedd901f3 caffe::Creator_SGDSolver<>() @ 0x40c20a train() @ 0x4088d8 main @ 0x7f6aebf39830 __libc_start_main @ 0x4091a9 _start @ (nil) (unknown) Aborted (core dumped)

Your system configuration

Operating system: Ubuntu x64 16.04.1 LTS Compiler: CUDA version (if applicable): 8.0 CUDNN version (if applicable): 5.1 BLAS: Python or MATLAB version (for pycaffe and matcaffe respectively):

TimoSaemann commented 7 years ago

@bosmart thank you for report. I fixed that.

bosmart commented 7 years ago

Brilliant, it's working now!

I have noticed one other slight issue - the cudnn5 version seems to be using more memory. With a batch size of 7 on the CamVid dataset I'm getting 10737MiB (cuddn3) vs 10795MiB (cuddn5); with batch size 8 I'm getting 12184MiB (cuddn3) vs out of memory error (cudnn5). 12200MiB is what's available. So now I'm wondering if cudnn5 just uses more memory or are there any other differences between caffe-segnet and caffe-segnet-cudnn5?

Thanks!

ronalddas commented 7 years ago

@bosmart Hi, what version of CUDA are you using? I am using CUDA 7.5 and cuddn 5 , and with Batch size 6 , am getting 9331MiB . Also its taking around 4 seconds per iteration on the CamVid DataSet.

bosmart commented 7 years ago

@ronalddas I'm using CUDA 8 (with both cudnn3 and cudnn5). With batch size 6 I'm getting 9287MiB (cudnn3) vs 9341MiB (cudnn5). Not sure how to easily measure iteration time though (my first time with caffe ;))

TimoSaemann commented 7 years ago

@bosmart I think it is not enough to just shift the include and lib64 folders to the right location. You have to re-build caffe-segnet-cudnn5 in addition. I am not able to build it again with cudnn3, because it is not supported anymore. You need at least cudnn4 for that. Furthermore, I do not know exactly whether between cudnn3 and cudnn5 a difference in memory exists. But it should be a lot faster.

By the way, you can measure forward and backward pass by using "time". http://caffe.berkeleyvision.org/tutorial/interfaces.html

TimoSaemann commented 7 years ago

FYI: I have added a note to the README.md:

If you like to speed up SegNet even further, you can run the BN-absorber.py script. It merges the batch normalization layer into convolutional layer by modyfing its weights and biases. In doing so, it is possible to accelerate it by around 30 %. Please find BN-absorber.py in the script folder.

TimoSaemann / caffe-segnet-cudnn5