apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Memory reservation feature #12842

Open eric-haibin-lin opened 6 years ago

eric-haibin-lin commented 6 years ago

MXNet GPU memory consumption changes during the training job. The training job easily gets OOM exception when the code is running in a shared environment with limited memory (multiple ppl sharing the same GPUs in the research lab). Whereas TF code always reserves x GB of GPU memory and is never kicked out once the job starts.

If we can have an API to reserve x GB GPU memory for MXNet, it will be great.

eric-haibin-lin commented 6 years ago

@samskalicky

samskalicky commented 6 years ago

Interesting idea @eric-haibin-lin, shouldnt this be something outside of mxnet like slurm where something outside of the job (ie. mxnet) in the OS contain resource usage for the process? Typically when you submit jobs in a shared environment you say upfront how many cores/memory you need for your job and then if the process uses more than its allocated the job gets killed.

Im not sure something like this works for GPUs thats commonly available (outside of the way cloud providers structure their hypervisors like EC2 to split hardware access -- GPUs -- between various guest OSes). But something outside of MXNet at the OS level would be better for containing resources between users IMO, but lmk what you think.

I guess the general idea of reserving memory upfront is good for performance to avoid individual allocations later on during the run. And this idea should apply to GPU (or any other accelerator) memory too. And it might help with the multi-user scenario like you mention where you can get the memory you need upfront (and fail early) rather than wait till the end of training and run out of memory.

Thoughts?