ShaoqingRen / faster_rcnn

Faster R-CNN
Other
2.71k stars 1.22k forks source link

Error when rerunning script #12

Open varun-nagaraja opened 9 years ago

varun-nagaraja commented 9 years ago

When I run script_faster_rcnn_demo for the first time after starting Matlab, things work fine. But if I re-run the same script after the first run, I get the following error

fast_rcnn startup done
GPU 1: free memory 11945885696
GPU 2: free memory 813449216
Use GPU 1
[libprotobuf ERROR google/protobuf/descriptor_database.cc:57] File already exists in database: caffe.proto
[libprotobuf FATAL google/protobuf/descriptor.cc:954] CHECK failed: generated_database_->Add(encoded_file_descriptor, size):
Caught "std::exception" Exception message is:
CHECK failed: generated_database_->Add(encoded_file_descriptor, size):
KapSteR commented 9 years ago

I get a similar error. Matlab simply shuts down when re-running the matlab demo. Often a reboot is required to get it to run again.

varun-nagaraja commented 9 years ago

Yup, reboot is the only way for me to get it working again.

ShaoqingRen commented 9 years ago

@varun-nagaraja @KapSteR

I can't reproduce this bug on Windows. Ross also hasn't reported this bug on Ubuntu.

In the head for script_faster_rcnn_demo, we clear caffe mex (mexLock() is commented), so there should be any error thrown by caffe in the second calling.

I think we should make sure that the mex is cleared on your machine as expected.

rbgirshick commented 9 years ago

I can reproduce the error in linux. It's low priority since it just affects the demo script and not training or testing. To clarify comments in the thread: a "reboot" of the computer is not required, just a restart of matlab.

KapSteR commented 9 years ago

So... It seems to my that there is somehow a GPU memory leak. The GPU memory usage grows linearly with every iteration of the main loop, until MATLAB crashes.

Is it wrong to assume that GPU memory usage is relatively constant with each forward pass, after "warm-up" ?

kukuruza commented 9 years ago

So is it a problem that mex doesn't clean up after itself correctly after all?

1) For me on Linux, the free gpu memory before the 2nd run (4205486080) is 1MB less than before the first run (4206583808). That looks like a leek indeed. 2) I also get a protobuf issue on the second run (Linux):

fast_rcnn startup done
GPU 1: free memory 4205486080
Use GPU 1

[libprotobuf ERROR google/protobuf/descriptor_database.cc:57] File already exists in database: caffe.proto
[libprotobuf FATAL google/protobuf/descriptor.cc:954] CHECK failed: generated_database_->Add(encoded_file_descriptor, size): 

------------------------------------------------------------------------
          std::terminate() detected at Mon Oct 12 13:07:26 2015
------------------------------------------------------------------------

Configuration:
  Crash Decoding      : Disabled
  Crash Mode          : continue (default)
  Current Graphics Driver: Unknown software 
  Current Visual      : None
  Default Encoding    : UTF-8
  GNU C Library       : 2.19 stable
  Host Name           : ip-172-31-21-65
  MATLAB Architecture : glnxa64
  MATLAB Root         : /usr/local/MATLAB/R2015a
  MATLAB Version      : 8.5.0.197613 (R2015a)
  OpenGL              : software
  Operating System    : Linux 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64
  Processor ID        : x86 Family 6 Model 45 Stepping 7, GenuineIntel
  Virtual Machine     : Java 1.7.0_60-b19 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode
  Window System       : No active display

Fault Count: 1

...
Stack Trace (captured):
[  0] 0x00007f53a6b6570e    /usr/local/MATLAB/R2015a/bin/glnxa64/libmwfl.so+00988942 _ZN2fl4diag5linux6x86_6412context_base12capture_dataEv+00000030
...
[ 12] 0x00007f52bd507c12 /usr/local/MATLAB/R2015a/bin/glnxa64/libprotobuf.so.8+00433170 _ZN6google8protobuf14DescriptorPool24InternalAddGeneratedFileEPKvi+00000194
[ 13] 0x00007f52bdc6c37c /home/ubuntu/src/faster_rcnn/external/caffe/matlab/+caffe/private/caffe_.mexa64+00443260
...
BlueCrow1991 commented 9 years ago

This bug does not just affect the demo script, but also training and testing on Ubuntu.

When I re-run 'script_faster_rcnn_VOC2007_ZF.m', it happened too.

YingjieYin commented 8 years ago

When I run script_faster_rcnn_demo errors in caffe_log: F1028 15:47:12.852134 2204 syncedmem.cpp:51] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure F1028 15:47:12.852134 2204 syncedmem.cpp:51] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure

fengyuxi55 commented 8 years ago

I can reproduce this problem.When I re-running script_faster_rcnn_demo.m, matlab crash:

[libprotobuf ERROR google/protobuf/descriptor_database.cc:57] File already exists in database: caffe.proto [libprotobuf FATAL google/protobuf/descriptor.cc:1018] CHECK failed: generateddatabase->Add(encoded_file_descriptor, size): Caught "std::exception" Exception message is: CHECK failed: generateddatabase->Add(encoded_file_descriptor, size):

corganhejijun commented 8 years ago

https://github.com/BVLC/caffe/issues/1917 is this problem the same as this Issue?

roytseng-tw commented 8 years ago

so how could I solve this problem? I don't really understand. thx

gjyin commented 8 years ago

how to solve the problem? I met the bug on Ubuntu 14.04 [libprotobuf ERROR google/protobuf/descriptor_database.cc:57] File already exists in database: caffe.proto [libprotobuf FATAL google/protobuf/descriptor.cc:954] CHECK failed: generateddatabase->Add(encoded_file_descriptor, size):

esason commented 8 years ago

I have solved the last issue ... THE BUG: Bug on Ubuntu 14.04, [libprotobuf ERROR google/protobuf/descriptor_database.cc:57] File already exists in database:
caffe.proto
[libprotobuf FATAL google/protobuf/descriptor.cc:954] CHECK failed:
generateddatabase->Add(encoded_file_descriptor, size): in the first time I am running training or testing phase everything works fine at the first running, but if the matlab is still on and I am trying to run it once again the bug occurs.

SOLUTION it seems that it related to clear mex issues.

It seems to works ok. I would like to know why using the mex clear at all.

ZiangYan commented 8 years ago

I have encountered the same bug, and solved it by re-compiling opencv with out dnn module. I found that caffe, protobuf, opencv-dnn couldn't work together. It seems to be a bug in either protobuf or opencv.

There are two solutions:

  1. statically link to protobuf (i.e., link to protobuf.a, NOT protobuf.so)

OR

  1. remove opencv_contrib/modules/cnn, and re-compile opencv
hongkaiyu2012 commented 7 years ago

Problem solved: https://github.com/ShaoqingRen/faster_rcnn/issues/112#issuecomment-273279959