cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

gpu queue max memory cap? #388

Closed lzamparo closed 8 years ago

lzamparo commented 8 years ago

Is there a limit to the amount of memory that you can request when submitting to the GPU queue? I did a cursory search of the user guide and found nothing advertised.

[zamparol@mskcc-ln1 submit_scripts]$ qsub submit_docker_kmer_basset_job.sh
qsub: submit error (Job exceeds queue resource limits MSG=cannot satisfy queue max mem requirement)

I've got a large data set (72Gb) that I'd like to hold in memory and then process in smaller batches on a GPU. Here's the relevant snippet of my submission script:

# walltime : maximum wall clock time (hh:mm:ss)
#PBS -l walltime=24:00:00,mem=96gb
#
# join stdout and stderr
#PBS -j oe
#
# spool output immediately
#PBS -k oe
#
# specify GPU queue
#PBS -q gpu
#
#  nodes: number of nodes
#  ppn: number of processes per node
#  gpus: number of gpus per node
#  GPUs are in 'exclusive' mode by default, but 'shared' keyword sets them to shared mode.
#  docker: indicator that I want to execute on a node that can run docker. (optional for other ppl)
#  gtxtitan: indicator that I want to execute on nodes that have this particular type of GPU (optional for other ppl)
#PBS -l nodes=1:ppn=1:gpus=1:docker:gtxtitan

So, is there specific flag I should use in my submission script that allows for large memory gpu queue jobs? Or is there a memory limit that cannot be circumvented? Or is this a reoccurrence of #226 ?

I really appreciate any help for this, it's for preliminary results for a grant due this weekend.

lzamparo commented 8 years ago

I see #93 set the limit at 10Gb. Does nobody else have to run jobs on larger data sets than 10Gb?

tatarsky commented 8 years ago

The GPU queue has had a max memory cap of 10GB as you note. Raised as you note from 4GB back aways.

set queue gpu resources_max.mem = 10gb

I don't really know why the queue was limited in this manner. I have no objection to raising it and to facilitate your research while perhaps some others try to recall why this was the case I made it 100GB.

Please retry as quickly as you can as its late here.

pgrinaway commented 8 years ago

I think you can still request a GPU in the batch queue, which is what I had done to get a lot of RAM+ a GPU.

tatarsky commented 8 years ago

I believe we added a submit filter (per some other Git request) to block that. But feel free to try ;)

tatarsky commented 8 years ago

Basically advise if anyone knows the reason and we can review in the morning. But @lzamparo given your deadline please confirm you can at least get something running.

tatarsky commented 8 years ago

Also just so you know #226 was a GPFS token memory exhaustion problem. Not related here. I am going offline. If you do not confirm you can run now I will assist in the morning.

tatarsky commented 8 years ago

Oh, and reviewing the submit filter it was to prevent running in the gpu queue without asking for a gpu. So the method @pgrinaway mentions would have worked as well. Offline now. You have two options to proceed. Will reduce max memory limit if asked for gpu queue in morning.

lzamparo commented 8 years ago

@pgrinaway : so you just submit to -q batch but also requested a gpu with #PBS -l nodes=1:ppn=1:gpus=1:docker:gtxtitan (or something similar)? Trying that now...

jchodera commented 8 years ago

That should work!

tatarsky commented 8 years ago

Please note my comments that the gpu queue max memory remains 100GB as well until folks ask me to drop it back down to 10GB.

lzamparo commented 8 years ago

Thanks @tatarsky . Cancelled the batch job, enqueued a gpu job.

tatarsky commented 8 years ago

I believe this to be resolved within reason. If somebody feels 100GB is too high a max for the GPU queue feel free to re-open.

akahles commented 8 years ago

Sorry for joining in so late. Only one question. I think motivation for the memory limit was to prevent people from taking the gpu queue to circumvent a full / stuffed batch queue. However, I don't know how the current systems integrates the priorities of these two queues.

tatarsky commented 8 years ago

Well, gpu inherits batch nodes. So if the batch queue is tapped out in slots and ram, so also is gpu. In other words if you are waiting on resources in batch you would in gpu as well I believe. I believe gpu has a slight preference in priority but also requires you ask for a gpu resource.

So again, I will lower the limit if desired but my goal was to allow @lzamparo to meet his deadline.

akahles commented 8 years ago

Having the current limit is fine with me. I just wanted to provide context with what I remember to be one of the reasons to put the limit into place initially.

tatarsky commented 8 years ago

Which is great and I appreciate it @akahles and if I don't hear anything to the contrary we will reference your sage memory of the situation when we discover abuse of the gpu queue ;)

lzamparo commented 8 years ago

Thanks again for the quick response @tatarsky , I should hopefully have some results by this weekend.

tatarsky commented 8 years ago

You are very welcome. Have a great weekend.