Open bonsairobo opened 8 years ago
GC is really unpredictable, I guess the generational GC is retaining some of the objects because they are still young? Maybe you can try to explicitly call the destructor like mx.delete!(exec.handle)
.
How do I import mx.delete!
? It seems like a private API.
Probably cannot call it directly. How about calling finalize(exec.handle)
?
See #84
I tried
mx.finalize(x.handle)
x = 0
gc()
and the GPU memory is still allocated.
Mxnet has its own internal memory pool, that retains memory for future arrays because cuda allocation is slow. So the memory goes back to the pool, but are not freed to nvidia's runtime
On Sun, Apr 24, 2016 at 4:16 PM Duncan Fairbanks notifications@github.com wrote:
I tried
mx.finalize(x.handle) x = 0 gc()
and the GPU memory is still allocated.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/dmlc/MXNet.jl/issues/85#issuecomment-214064727
Oh that helps my understanding! What is the policy for re-using memory in the pool? E.g. what if I finalize
a chunk of memory then ask for a larger chunk. Would the older chunk be reused? Would the pool ever return memory to CUDA to ask for a larger contiguous chunk?
The reason I ask is that I am trying to create two different executors of the same network corresponding to different input sizes. I know I have enough memory to support either input size separately, but I cannot figure out how to allocate them both at mutually exclusive times in my code.
Sorry if this is a lot of questions. I can also take a look at the mxnet engine code if it is easily comprehensible to a non-DMLC member.
There are two factors in executor memory consumption.
If you are using two executors exclusively, there is a support for memory sharing between executors, e.g. bucketing API, which is currently supported in python. You can bind the executor with larger input size, and share its memory with the smaller executor in that setting.
That's good to know about the Python memory sharing. I'm going to stick with the Julia API for now.
I cannot seem to reuse an old (no longer needed) executor's GPU memory for a new executor, even after finalizing the handles. I think a simple API to explicitly free GPU memory would be very helpful (even if less performant) in some scenarios.
For now, I am going to make all input data the same size through resizing. This may have adverse effects on the results, but it will likely be negligible.
I'm trying to write Neural Style in MXNet.jl, and I keep running out of memory when I try to make new executors (and delete the old ones). My basic strategy is to store the executor in an
exec
variable and doexec = 0
gc()
when I want to reclaim GPU memory for that executor. This does not work as expected, as I am tracking CUDA memory usage with
nvidia-smi
and there is never a drop in memory usage after callinggc()
.Does anyone know of a way to reclaim GPU memory? Here is my code for reference: https://github.com/bonsairobo/mxnet-neural-style/blob/master/stylenet.jl