Closed neilnelson closed 8 years ago
This seems a little odd in that the 22,144,025 possibly lost bytes is related to Cuda libs out of caffe::CuDNNConvolutionLayer. That is, if the GPU is not being used in a CPU run, Cuda libs would not expected to be used as shown.
This is probably related to #3953
Update
This page http://stackoverflow.com/questions/20593450/valgrind-and-cuda-are-reported-leaks-real writes that valgrind gives false memory leaks for Cuda.
The ending mnist output for cuda-memcheck --leak-check full is
========= Program hit cudaErrorCudartUnloading (error 29) due to "driver shutting down" on CUDA API call to cudaFree. ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2f31b3] ========= Host Frame:/usr/local/cuda-7.0/lib64/libcurand.so.7.0 [0xa0f76] ========= Host Frame:/usr/local/cuda-7.0/lib64/libcurand.so.7.0 (curandDestroyGenerator + 0x254) [0x234f4] ========= Host Frame:/mnt/Documents/caffe-master/.build_debug/tools/../lib/libcaffe.so.1.0.0-rc3 (_ZN5caffe5CaffeD2Ev + 0x100) [0x32d0ea] ========= Host Frame:/mnt/Documents/caffe-master/.build_debug/tools/../lib/libcaffe.so.1.0.0-rc3 (_ZN5boost19thread_specific_ptrIN5caffe5CaffeEE11delete_dataclEPv + 0x22) [0x32fa86] ========= Host Frame:/usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0 (_ZN5boost6detail12set_tss_dataEPKvNS_10shared_ptrINS0_20tss_cleanup_functionEEEPvb + 0x41) [0xd021] ========= Host Frame:/mnt/Documents/caffe-master/.build_debug/tools/../lib/libcaffe.so.1.0.0-rc3 (_ZN5boost19thread_specific_ptrIN5caffe5CaffeEED1Ev + 0x33) [0x32fac9] ========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__cxa_finalize + 0x9a) [0x3c5ea]
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations ========= ERROR SUMMARY: 29 errors
giving 0 bytes leaked.
The linked page also gives a valgrind correction for Cuda with the ending mnist output
==27444== LEAK SUMMARY: ==27444== definitely lost: 0 bytes in 0 blocks ==27444== indirectly lost: 0 bytes in 0 blocks ==27444== possibly lost: 86,935 bytes in 1,907 blocks ==27444== still reachable: 554,312 bytes in 6,338 blocks ==27444== suppressed: 85,794,483 bytes in 98,237 blocks
The 'possibly lost' is quite low here.
At this point the issue is that I can document a memory leak of around 60 megabytes per mnist run by logging the memory stats every few seconds while repeating the mnist run until the computer runs out of memory and starts to freeze. Of course you only need to repeat the mnist run a dozen or so times while logging the computer's memory to see the effect and get a memory leak estimate.
Update
This page https://github.com/neilnelson/misc/blob/master/cuda_mnist_mem_used.png has a chart of the memory-used for a single Cuda mnist run. 'free' was recorded about every .01 seconds throughout the run (5000 observations over about 50 seconds). The 'used' value to the right of '-/+ buffers/cache:' and less the amount used (at each observation) by the ram disk containing the logging is the charted memory-used value.
Seconds is zeroed at the first time reported in the Caffe run log. The Caffe run ends on the chart at the large drop in memory-used toward the right of the chart.
The memory-used shelf to the right after the Caffe run less the beginning memory-used is just over 60 megabytes.
The majority of memory is allocated in the first second that ends just before the second (and last) occurrence of 'Setting up ip1'. Given the size of the memory-used drop at the end of the run and that the majority of the memory allocation occurs in the first second, this argues (not certainly but as a reasonable direction) that the allocation resulting in the memory leak occurs in the first second.
Could you share the command you are using to reproduce the bug?
This is an mnist run using Caffe compiled for debug. I noticed the memory leak using the standard Caffe install without debug. caffe-master has a README.md of Mar 9 13:42.
/home/nnelson/Documents/caffe-master/.build_debug/tools/caffe train --solver=/voxforge/runs/test/solver.prototxt &> /voxforge/runs/test/cuda_speed_run5.log
/voxforge is a 1 gig RAM disk where all the input is copied before the run and output is placed.
/voxforge/runs/test/solver.prototxt contains
#
The train/test solver.prototxt net protocol buffer definition
#
net: "/voxforge/mnist/train_val.prototxt"
train_net: "/voxforge/caffe/train.prototxt" test_net: "/voxforge/caffe/test.prototxt"
#
test_iter specifies how many forward passes the test should carry out.
#
In the case of MNIST, we have test batch size 100 and 100 test iterations,
#
covering the full 10,000 testing images.
test_iter: 100
#
Carry out testing every 25 training iterations.
test_interval: 25
#
The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
#
The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
#
Display every 10 iterations
display: 10
#
The maximum number of iterations
max_iter: 200
#
snapshot intermediate results
#
snapshot: 5000
#
snapshot_prefix: "/voxforge/mnist/lenet"
#
solver_mode: CPU or GPU
I see that the pound signs have made the text large and bold.
Use ``` before and after.
And you should share your prototxt on a GitHub repo, ideally with a script to run the test.
A detailed procedure to do an easy version of a memory leak test is at https://github.com/neilnelson/caffe_memory_test.
Even if there is a memory leak in the code, memory should be reclaimed when the process exits: http://stackoverflow.com/questions/2975831/is-leaked-memory-freed-up-when-the-program-exits
So, it seems you have a deeper problem, you should try to update your kernel and your drivers if possible.
I am running Ubuntu 14.04 and everything is kept up to date. I see that Xubuntu/Ubuntu 16.04 is out and I can upgrade to that.
Running the memory test at https://github.com/neilnelson/caffe_memory_test seems like an easy thing to do. You will either see that your equipment and software will show no leak or that what I have been writing is confirmed. The test here took just over 9 minutes.
Ubuntu 14.04.4 LTS kernel 3.13.0-85-generic gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4 GeForce GT 720 Cuda compilation tools, release 7.0, V7.0.27 libatlas-base-dev 3.10.1-4 libblas-dev Version: 1.2.20110419-7 libboost1.55-all-dev protobuf-compiler 2.5.0-9ubuntu1 python-protobuf 2.5.0-9ubuntu1 libgoogle-glog-dev 0.3.3-1 python-gflags 1.5.1-1build1 libhdf5-dev 1.8.11-5ubuntu7 libcudnn.so.4.0.4 liblmdb-dev 0.9.16-1~ubuntu14.04.1 libleveldb-dev 1.15.0-2 libsnappy-dev 1.1.0-1ubuntu1 libopencv-dev 2.4.8+dfsg1-2ubuntu1 python2.7 2.7.6-8ubuntu0.2 python3 3.4.0-0ubuntu2 python-numpy 1:1.8.2-0ubuntu0.1 python3-numpy 1:1.8.2-0ubuntu0.1
I tried, I had no leak. You didn't mention the version of your NVIDIA driver.
Thank you for checking this out. To help me along, would you please post your mem_test.log from the test. Thank you.
Here is what mine looks like.
total used free shared buffers cached
Mem: 16421400 3231908 13189492 19896 151992 1452364 -/+ buffers/cache: 1627552 14793848 Swap: 32767996 0 32767996 1461465313 1461465331 1461465349 1461465367 1461465386 1461465404 1461465422 1461465440 1461465458 1461465476 1461465494 1461465513 1461465531 1461465549 1461465567 1461465585 1461465603 1461465621 1461465639 1461465657 1461465676 1461465694 1461465712 1461465730 1461465748 1461465766 1461465784 1461465802 1461465820 1461465839 1461465857 total used free shared buffers cached Mem: 16421400 4953212 11468188 20420 153432 1457144 -/+ buffers/cache: 3342636 13078764 Swap: 32767996 0 32767996
This github page provides detailed instructions to obtain the single run Caffe mnist memory usage chart, a procedure mentioned in a prior post above.
I am doing this because I think Caffe is likely the best neural network software at the moment and am trying to correct an issue that hung my computer in a short period during small a inquiry I was performing.
Having detailed procedures giving clear output that may be repeated by anyone is basic to community progress. I hope to see output from others.
Here is the NVIDIA driver version.
cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.1)
I read through http://stackoverflow.com/questions/2975831/is-leaked-memory-freed-up-when-the-program-exits and see that the memory should be freed, and the adjusted valgrind and cuda-memcheck above do not show any significant leaks. The tests I am providing here show that not all memory is freed.
@neilnelson,
Thanks for writing the test tools. We're also suspecting a memory leak in Caffe or one of its dependencies. I ran your test and got a similarly shaped curve for the memory usage and approx. 60 MB not freed up after the process terminates. We have several users running experiments on the same server and it only takes a few days for the unclaimed RAM to make it very difficult to run anything on the machine without a system reboot.
We're running the latest on BVLC:master (commit 923e7e8b6337f610115ae28859408bc392d13136, Date: Wed May 25 18:11:15 2016 -0700), in GPU mode, with cuda 7.5
cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 361.42 Tue Mar 22 18:10:58 PDT 2016 GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.1)
We have another machine running a much older running off of NVIDIA Driver Version: 340.96 and cuda 6.5. It doesn't look like this one's affected by the leak. But can't tell for sure.
Any ideas on where the leak could be coming from?
Thanks
@kashefy,
I have been doing some Caffe memory usage tests these last two days and am currently installing the latest Cuda and Caffe software to avoid outdated software issues that might affect the result.
One idea that came to mind that might narrow down the search for source of the memory leak is to start by seeing if the
make runtest
part of the installation sequence, uses up memory. I suggest stacking ten or more of those runs in a bash file, much like my first memory test using an mnist run and using free at the beginning and end to show how much, if any, memory was used.
If the average usage per run was significant, then there should be a way to run free at the beginning of runtest and after each individual test. Since there are over 1,000 of these tests, with a little luck we may get fairly close to the problem area.
I have seen after a test run that the -/+ buffers/cache
figure from free tends to settle a bit over about 15 seconds or so. Another item is that test/get_memory_usage.php
, used in the second test, quickly ramps up to around a 4 megabyte memory use that closely accounts for the different estimates at the end of the pages for my two tests. This 4 megabyte memory is fairly flat after the initial ramp up and goes away after that php program ends.
@neilnelson, I think your idea to run the memory check around the individual unit tests is great to systematically isolate the leak. But as you pointed out, there are > 1,000 tests, so there'll be a bit of an effort in setting this up.
I ran some more tests, I've uploaded all the data logged by mem_used.csv to this sheet:
I don't think this has to with LMDB. I don't have anything quantitative for this, but I've had the same memory problem using HDF5 data. The amount of memory leaked seems independent from the number of iterations.
Preliminary: I'm also seeing this with tensorflow, but not as consistently as with Caffe.
test/get_memory_usage.php
and found that approx. 250 MB RAM were not recovered after the training was finished.I'm inclined to think this is a problem with the driver or CUDA toolkit. But not sure how to go about it. Maybe monitoring memory around each Caffe unit test is the way to go.
@neilnelson,
re-monitoring memory usage before+after caffe unit tests:
Test test can be executed either through cd ./build && make runtest
or ./build/test/test.testbin
. The former includes a cmake command and bulding the tests, so I recommend building once and then running the executable directly.
Just from watching the memory row in htop
, I could see both leaving 70-80 MB of RAM after execution.
We might not have to do such an exhaustive search after all, I limited the tests to those of the Im2colLayerTest test (limited to 8 tests with CPU
./build/test/test.testbin --gtest_filter=*Im2colLayer*/*2*.*
Changing the 2 at the end to 1 switches the tests TypeParam from caffe::GPUDevice
Can narrow it down to even fewer tests. But if both CPU and GPU tests are leaking, it doesn't make sense to suspect the leak happening in CUDA or the NVIDIA driver, right?
Narrowing it down to a single test:
./build/test/test.testbin --gtest_filter=Im2colLayer/1.Setup # for CPU
and
./build/test/test.testbin --gtest_filter=Im2colLayer/1.Setup # for GPU
In htop
, memory usage after test execution is still appprox. 70 MB higher than before test run.
@kashefy,
Thank you for your great progress on the memory leak issue. In order to remove execution of all the CUDA code Caffe has to be recompiled with the following lines in Makefile.config
# USE_CUDNN := 1
CPU_ONLY := 1
fix42 commented above on this issue.
My guess is that this program is performing the test you have identified.
caffe-master/src/caffe/test/test_im2col_layer.cpp
The syntax in your post
./build/test/test.testbin --gtest_filter=Im2colLayer/1.Setup # for GPU
tends toward the following code in the program. You may have a better idea.
template <typename TypeParam>
class Im2colLayerTest : public MultiDeviceTest<TypeParam> {
typedef typename TypeParam::Dtype Dtype;
protected:
Im2colLayerTest()
: blob_bottom_(new Blob<Dtype>(2, 3, 6, 5)),
blob_top_(new Blob<Dtype>()) {
// fill the values
Caffe::set_random_seed(1701);
FillerParameter filler_param;
GaussianFiller<Dtype> filler(filler_param);
filler.Fill(this->blob_bottom_);
blob_bottom_vec_.push_back(blob_bottom_);
blob_top_vec_.push_back(blob_top_);
}
virtual ~Im2colLayerTest() { delete blob_bottom_; delete blob_top_; }
Blob<Dtype>* const blob_bottom_;
Blob<Dtype>* const blob_top_;
vector<Blob<Dtype>*> blob_bottom_vec_;
vector<Blob<Dtype>*> blob_top_vec_;
};
TYPED_TEST_CASE(Im2colLayerTest, TestDtypesAndDevices);
TYPED_TEST(Im2colLayerTest, TestSetup) {
typedef typename TypeParam::Dtype Dtype;
LayerParameter layer_param;
ConvolutionParameter* convolution_param =
layer_param.mutable_convolution_param();
vector<int> bottom_shape;
bottom_shape.push_back(2);
bottom_shape.push_back(3);
bottom_shape.push_back(10);
bottom_shape.push_back(11);
this->blob_bottom_->Reshape(bottom_shape);
convolution_param->add_kernel_size(3);
convolution_param->add_stride(2);
convolution_param->add_dilation(3);
Im2colLayer<Dtype> layer(layer_param);
layer.SetUp(this->blob_bottom_vec_, this->blob_top_vec_);
EXPECT_EQ(this->blob_top_->num(), 2);
EXPECT_EQ(this->blob_top_->channels(), 27);
EXPECT_EQ(this->blob_top_->height(), 2);
EXPECT_EQ(this->blob_top_->width(), 3);
}
Given that just this test can be performed, the code here can be progressively truncated at the end (keep a backup copy) and recompiled and rerun until the memory leak disappears. At which point the last line or lines removed get closer to the cause.
Im2colLayer
comes from
caffe-master/src/caffe/layers/im2col_layer.cpp
also note the includes at the top of that file.
I need to make headway on another project at the moment but am looking forward to your excellent progress.
@neilnelson, could you reopen the issue. Just to make it clear that the problem hasn't been resolved yet. I'm looking for a simpler test that reproduces the leak, will keep you posted on what I find.
@kashefy, The issue is now open. I closed it by mistake.
This post suggests the leak is coming from the NVIDIA dirvers. Link: http://superuser.com/questions/1062929/freeing-kernel-memory-leaked-by-nvidia-driver The Latest Long Lived Branch version from NVIDIA's website for Linux 64-bit is currently at 361.45.11, which doesn't fix the leak. If you switch to the Latest Short Lived Branch version on this page from NVIDIA http://www.nvidia.com/object/unix.html, you can get something a bit more recent. First runs show that the leak is gone. Will keep you posted if anything changes and not sure how safe it is to use a driver outside of the long lived branch. But for now: Hurray!
Thanks for the good news!
After running several iterations of the memory leak test here, the conclusion is that given the inherent degree of error in the test, the hypothesis that software versions of the Caffe installation given below leaks memory is not supported.
NVRM version: NVIDIA UNIX x86_64 Kernel Module 364.19 Tue Apr 19 14:44:55 PDT 2016 http://www.nvidia.com/object/unix.html
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) Ubuntu 14.04.4 LTS
Caffe 2016 May 30 18:04 http://caffe.berkeleyvision.org/install_apt.html
NVIDA CUDA 7.5 Tool Kit - cuda_7.5.18_linux.run https://developer.nvidia.com/cuda-downloads
Many thanks to @kashefy and NVIDIA for identifying a solution.
I posted this on the users group and a responder said it may be useful here.
valgrind will complete if leveldb is used instead of lmdb.
Compiled with GPU and run with GPU ==4315== LEAK SUMMARY: ==4315== definitely lost: 96 bytes in 4 blocks ==4315== indirectly lost: 0 bytes in 0 blocks ==4315== possibly lost: 70,778,113 bytes in 20,095 blocks ==4315== still reachable: 114,126,748 bytes in 153,749 blocks ==4315== suppressed: 0 bytes in 0 blocks
Compiled with GPU and run as CPU ==5232== LEAK SUMMARY: ==5232== definitely lost: 96 bytes in 4 blocks ==5232== indirectly lost: 0 bytes in 0 blocks ==5232== possibly lost: 70,775,503 bytes in 20,075 blocks ==5232== still reachable: 114,116,564 bytes in 153,675 blocks ==5232== suppressed: 0 bytes in 0 blocks
Compiled as CPU only ==12882== LEAK SUMMARY: ==12882== definitely lost: 0 bytes in 0 blocks ==12882== indirectly lost: 0 bytes in 0 blocks ==12882== possibly lost: 86,909 bytes in 1,906 blocks ==12882== still reachable: 552,984 bytes in 6,332 blocks ==12882== suppressed: 0 bytes in 0 blocks
This valgrind page http://valgrind.org/docs/manual/faq.html#faq.deflost gives the following for 'possibly lost'.
"possibly lost" means your program is leaking memory, unless you're doing unusual things with pointers that could cause them to point into the middle of an allocated block; see the user manual for some possible causes.
The mnist GPU run was repeated 273 times (just under 100 minutes) at which point the computer was freezing up from lack of memory. 'cat /proc/meminfo' was used to collect memory stats every 10 seconds during the sequence. Charts of the 42 stats in /proc/meminfo are combined on to a single page at https://github.com/neilnelson/misc/blob/master/meminfo.png. (Right click, select View Image, click on the image to expand.)
The interesting charts are MemFree declines to about zero (175372 bytes) at the 219 run. Buffers stays the same until MemFree gets to zero and then declines to zero Cached is similar to Buffers SwapFree declines (an increase in swap usage) when MemFree gets to zero.
Just before LEAK SUMMARY in the 'compiled with GPU and run as CPU' run instance is the following 'possibly lost' section.
==5232== 22,144,025 bytes in 2,207 blocks are possibly lost in loss record 2,543 of 2,544 ==5232== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==5232== by 0x28E90F1D: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.352.63) ==5232== by 0x28E4134C: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.352.63) ==5232== by 0x28E5211F: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.352.63) ==5232== by 0x28F3FCCF: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.352.63) ==5232== by 0x28F3FFDF: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.352.63) ==5232== by 0xE43FD2C: ??? (in /usr/local/cuda-7.0/targets/x86_64-linux/lib/libcudnn.so.4) ==5232== by 0xE432AAF: ??? (in /usr/local/cuda-7.0/targets/x86_64-linux/lib/libcudnn.so.4) ==5232== by 0xE43EDB6: ??? (in /usr/local/cuda-7.0/targets/x86_64-linux/lib/libcudnn.so.4) ==5232== by 0xE443570: ??? (in /usr/local/cuda-7.0/targets/x86_64-linux/lib/libcudnn.so.4) ==5232== by 0xE4371DB: ??? (in /usr/local/cuda-7.0/targets/x86_64-linux/lib/libcudnn.so.4) ==5232== by 0xE4256A1: ??? (in /usr/local/cuda-7.0/targets/x86_64-linux/lib/libcudnn.so.4) ==5232== by 0xE458C9E: ??? (in /usr/local/cuda-7.0/targets/x86_64-linux/lib/libcudnn.so.4) ==5232== by 0xE146C11: cudnnCreate (in /usr/local/cuda-7.0/targets/x8664-linux/lib/libcudnn.so.4) ==5232== by 0x51CE276: caffe::CuDNNConvolutionLayer::LayerSetUp(std::vectorcaffe::Blob<float , std::allocatorcaffe::Blob > const&, std::vectorcaffe::Blob<float, std::allocatorcaffe::Blob<float> > const&) (cudnn_convlayer.cpp:53)
==5232== by 0x518EC4B: caffe::Layer::SetUp(std::vectorcaffe::Blob<float , std::allocatorcaffe::Blob > const&, std::vectorcaffe::Blob<float, std::allocatorcaffe::Blob<float> > const&) (layer.hpp:71)
==5232== by 0x51946E0: caffe::Net::Init(caffe::NetParameter const&) (net.cpp:139)
==5232== by 0x5192A76: caffe::Net::Net(caffe::NetParameter const&, caffe::Net const*) (net.cpp:27)
==5232== by 0x516BB72: caffe::Solver::InitTrainNet() (solver.cpp:105)
==5232== by 0x516B395: caffe::Solver::Init(caffe::SolverParameter const&) (solver.cpp:57)
This seems a little odd in that the 22,144,025 possibly lost bytes is related to Cuda libs out of caffe::CuDNNConvolutionLayer. That is, if the GPU is not being used in a CPU run, Cuda libs would not expected to be used as shown.
The 'possibly lost' figure of the 'Compiled as CPU only' LEAK SUMMARY is insignificant when compared to the figures from the Cuda compiled code.
The memory decline over the first 192 Cuda mnist runs averages to 59.8 megabytes per run. valgrind shows 67.5 megabytes 'possibly lost' for a single run.