Closed samary closed 8 years ago
This is cluster specific. In general, it is easier to request more cores on PBS clusters rather than a specific amount of RAM. In Canada, for example, if we want to run a job with 5000 MB we request two cores with 2500 MB each. This would break Canadian cluster configs. What about jobs that require more than 4000 MB RAM? Can the job then not run on your cluster?
I understand your point but I don't think PBS will do this. As far as I know, if you query : qsub -l mem=5000mb,ppn=1,cpus=2 You will have one job with 2 cpus and 5000mb allowed ram for the 2 cpus (no scale up per cpus requested)
For your question : I will let it run with the default queue (without any queue defined), and let the cpu scale up working (as before), but the ram requirement has to be correct (above comment) . I don't think this will break anything for anyone. Right ?
We usually run something like #PBS -l mem=2500mb,ppn=1,cpus=2
in Canada. What matters is what gets reported to HTCondor in Madison. HTCondor will see 2 CPUs and a total memory of 5000 MB and will decide how to split up the split or not.
Ok, but on the site level (torque), you are seen like requesting less of ram than what you truly want to use. So you can lead to memory starvation by misleading the batch-system. I think it worked on those sites because PBS cannot enforce this ram limitation (big problem known for a long time) without cgroups. So your jobs were never killed by the site. I don't really know how condor handle resources (ram per core), but there is something strange here, IMHO.
How would there be memory starvation if I ask PBS (torque, moab, whatever) to give me 2 cores with 2500 MB each and I use one core and 4000 MB, for example? The scheduler will reserved 5000 MB and 2 cores for me on a node. PBS does not care how I use the resources, just what I asked for. I don't see how we would be starved for memory except in the case that the job goes over the advertised memory requirement. Something we have no control over. I may be misleading the scheduler by requesting more CPUs than I need. Ultimately, that should just hurt me rather than help me though.
Our PBS implementation will keep track of RAM. Jobs get killed cause it used too much RAM at least once or twice a day. I know because I get an email.
Condor handles memory as blocks rather than per requested CPU:
request_memory = 5000
request_cpu=2
will give you 5000 MB and 2 cores. I am not sure how that is strange. It makes more sense than trying to figure out the memory per core requested.
One thing I'm wondering is, could different variants of PBS see the memory request in different ways?
For instance, there are two ways of interpreting a request for 2CPUs, 5000MB:
Which one is actually correct? Maybe (2) is valid in Canada, and (1) is valid in IIHE.
The solution is probably to stop calling both of them PBS, and make a subclass for one of them.
My point is : I'm not sure about the "each" in give me 2 cores with 2500 MB each
. I don't have much experience with multiple cores. So I understand this as a request for the whole job and not per cpu.
@dsschult : I think there might be different implementation or a misunderstanding (on my side). I always understood it as the case 1.
@briedel : Which version of Torque are you running ? I'm wondering how you can enforce this limit without cgroups. We are stuck with this issue since a long time without any solution (except cgroups). If you have a solution, it will be very helpful ! This is one (among others) of the reasons we want to move to HTCondor too.
We are running both MOAB and torque here. Not quite sure how, but that is how it goes. I guess the difference is question mem
vs. pmem
Where pmem
is per core and mem
is total memory
Doing a quick test. If one sets mem == pmem*num_cpus
the requested slot will still look the same. So one thing we could do is. Keep dividing the total memory requested by the num_cpus requested. And simply change the mem_per_core
option for the different queues you can run and also set a max_requested_ram
so it will ignore any jobs that will request too much memory for that queue
Maybe MOAB is doing it, because PBS will try to enforce this limit by using ulimit
on RSS memory but it doesn't work on "modern" linux kernel (http://stackoverflow.com/questions/3043709/resident-set-size-rss-limit-has-no-effect/6365534#6365534). Are you using cgroups ?
My guess is yes. I am not 100% sure what they use in the background.
Ok, that explains why you can get you job killed. We never could. Only cgroups seems to be working. I'll make some tests to see how the memory request is handled by our system and I come back to you. Cheers
So, after some investigations and tests, here is what I understand :
If you ask both mem and pmem, only the bigger one is taken : eg : qsub -l mem=4000mb,pmem=4000mb,nodes=1:ppn2 You will get 4000mb per core (means 2x4000mb for the whole job) : mem parameter ignored.
If you ask only mem, PBS will divide this mem per core requested : eg : qsub -l mem=4000mb,nodes=1:ppn2 You will get 2000mb per core (4000mb/2 cores)
So defining both of them doesn't make sense (in our case). We should only set pmem attribute to have the correct ram/core as expected.
I just realized that this pull request would break the calculation for the correct number of CPUs. This is needed to get the correct number of cores to accommodate the RAM request. I agree that we should just set pmem
to avoid confusion.
Proposed PR #25 to set only pmem.
Fixed with #25 merged
The requested memory is used for the whole job. We don't need to divide by the number of core. BTW, this limitation on RAM never worked without cgroups (on our site)