BVLC / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
34.09k stars 18.7k forks source link

Caffe memory increases with time(iterations?) #1377

Closed lavania closed 7 years ago

lavania commented 10 years ago

So I have been trying to train the imagenet example using caffe. I am using cuda 6.5 with cuDNN. But when I run the trainer as shown in http://caffe.berkeleyvision.org/gathered/examples/imagenet.html along with GLOG_log_dir I see that the memory usage of caffe (seen using top) keeps on increasing with time (iterations ?) . So the top gives something like :

SIZE RES SHR STATE TIME CPU COMMAND 2285G 58G 57G run 28:57 140% caffe

This causes the machine to slow down to a crawl (cannot type anything on the console). Any idea what might be causing this. I see this with both leveldb and lmdb.

Note that the GPU memory usage remains fixed.

Regards

beniz commented 9 years ago

I believe this is the mmap call from lmdb and leveldb that will use as max cache as possible, but I might be wrong.

amiralush commented 9 years ago

I have also been experiencing this phenomena when using the dev branch. I'm not sure it's related to the LMDB's memory consumption since in earlier versions I didn't encounter this when using LMDB. I don't yet fully understand this. @sguada any suggestions?

sguada commented 9 years ago

@amiralush we need to redo the #1238 to simplify its complexity and fix any memory overhead. https://github.com/BVLC/caffe/pull/1238#issuecomment-59633561 Currently we are in deadline mode and cannot do it, but PR are welcome.

sguada commented 9 years ago

@lavania the memory usage is due to LMDB since it tries to map the file into memory for faster access. That shouldn't happen with LEVELDB, so you can try that, but you will need to generate the leveldb from the data again.

raingo commented 9 years ago

This lmdb fork: https://github.com/raingo/lmdb-fork/tree/mdb.master solves the problem (not fully tested and won't update with upstream).

bwang0 commented 9 years ago

Try setting /proc/sys/vm/swappiness to zero. LMDB will still use up lots of memory as page cache, but it will be efficient as intended and machine won't slow down to a crawl. In our case, it slows down because it uses swap too aggressively, resulting in thrashing, and everything on the system waits on hard drive IO to complete. After swappiness gets set to zero, we see no more freezes and slow down of the whole system.

Based on this answer from the main author of LMDB, Howard Chu. http://www.openldap.org/lists/openldap-technical/201503/msg00077.html

If you don't what change the swap behavior for the whole system, look into cgroups, where you can tune the memory usage and caching behavior for individual processes. Hope this helps!

raingo commented 9 years ago

I don't think the lmdb-fork downgrade the efficiency. The only thing it does is to notify OS to release the page cache read in the last iteration. We have already used it to train many models. I guess the lmdb author's concern with MAP_PRIVATE is irrelevant in sequential read case.

Have you tried the swappiness=0 solution with multiple training processes on multi-GPU machine? Our first try is to set the swappiness, but that fails again with multiple training processes, so we end up with the hacky solution.

Hope that helps.

immars commented 9 years ago

@sguada in my test, memory usage do grow when using leveldb, even in rand_skiping of DataLayer.

If caffe does not need random access, which seems to be true now, I think it's better to use just plain file instead of these dbs.

IMHO It's a poor way to use RAM against sequential read, no matter mmaping or caching programmatically, unless you have enough memory to fit in the entire dataset.

In #2193 I tried training googlenet for 12 hours and RAM usage does not grow above 1G.

woozzu commented 9 years ago

For Windows case, the problem of lmdb can be solved by just adding following lines at the top of LMDBCursor::Seek method. It releases memory mapped pages which are not used after current seek.

if (op != MDB_FIRST)
    VirtualUnlock(mdb_value_.mv_data, mdb_value_.mv_size);
hanchaow commented 6 years ago

@woozzu How about the linux version of caffe?

iamcoming233 commented 5 years ago

@woozzu How about the linux version of caffe?

maybe munlock(data.mv_data, data.mv_size);