Julia garbage collector does not reclaim GPU memory

dmlc / MXNet.jl

MXNet Julia Package - flexible and efficient deep learning in Julia

371 stars 70 forks source link

Julia garbage collector does not reclaim GPU memory #85

Open bonsairobo opened 8 years ago

bonsairobo commented 8 years ago

I'm trying to write Neural Style in MXNet.jl, and I keep running out of memory when I try to make new executors (and delete the old ones). My basic strategy is to store the executor in an exec variable and do

exec = 0 gc()

when I want to reclaim GPU memory for that executor. This does not work as expected, as I am tracking CUDA memory usage with nvidia-smi and there is never a drop in memory usage after calling gc().

Does anyone know of a way to reclaim GPU memory? Here is my code for reference: https://github.com/bonsairobo/mxnet-neural-style/blob/master/stylenet.jl

pluskid commented 8 years ago

GC is really unpredictable, I guess the generational GC is retaining some of the objects because they are still young? Maybe you can try to explicitly call the destructor like mx.delete!(exec.handle).

bonsairobo commented 8 years ago

How do I import mx.delete!? It seems like a private API.

pluskid commented 8 years ago

Probably cannot call it directly. How about calling finalize(exec.handle)?

vchuravy commented 8 years ago

See #84

bonsairobo commented 8 years ago

I tried

mx.finalize(x.handle)
x = 0
gc()

and the GPU memory is still allocated.

tqchen commented 8 years ago

Mxnet has its own internal memory pool, that retains memory for future arrays because cuda allocation is slow. So the memory goes back to the pool, but are not freed to nvidia's runtime

On Sun, Apr 24, 2016 at 4:16 PM Duncan Fairbanks notifications@github.com wrote:

I tried

mx.finalize(x.handle) x = 0 gc()

and the GPU memory is still allocated.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/dmlc/MXNet.jl/issues/85#issuecomment-214064727

bonsairobo commented 8 years ago

Oh that helps my understanding! What is the policy for re-using memory in the pool? E.g. what if I finalize a chunk of memory then ask for a larger chunk. Would the older chunk be reused? Would the pool ever return memory to CUDA to ask for a larger contiguous chunk?

The reason I ask is that I am trying to create two different executors of the same network corresponding to different input sizes. I know I have enough memory to support either input size separately, but I cannot figure out how to allocate them both at mutually exclusive times in my code.

Sorry if this is a lot of questions. I can also take a look at the mxnet engine code if it is easily comprehensible to a non-DMLC member.

tqchen commented 8 years ago

There are two factors in executor memory consumption.

The executor itself tries to retain and share memory between nodes without runtime re-allocation.
- The memory sharing within an executor pool can re-use memory of different size, as it is a static allocation strategy and does not require fast runtime.
The imperative API uses exact size matching as memory pool for speed reason, so the memory of different size won't be re-used in current strategy, unless the memory requirement hits an wall and free re-allocation happens. https://github.com/dmlc/mxnet/blob/master/src/storage/pooled_storage_manager.h

If you are using two executors exclusively, there is a support for memory sharing between executors, e.g. bucketing API, which is currently supported in python. You can bind the executor with larger input size, and share its memory with the smaller executor in that setting.

bonsairobo commented 8 years ago

That's good to know about the Python memory sharing. I'm going to stick with the Julia API for now.

I cannot seem to reuse an old (no longer needed) executor's GPU memory for a new executor, even after finalizing the handles. I think a simple API to explicitly free GPU memory would be very helpful (even if less performant) in some scenarios.

For now, I am going to make all input data the same size through resizing. This may have adverse effects on the results, but it will likely be negligible.