[FIX] Don't divide requested memory per core (for PBS)

samary commented 8 years ago

The requested memory is used for the whole job. We don't need to divide by the number of core. BTW, this limitation on RAM never worked without cgroups (on our site)

briedel commented 8 years ago

This is cluster specific. In general, it is easier to request more cores on PBS clusters rather than a specific amount of RAM. In Canada, for example, if we want to run a job with 5000 MB we request two cores with 2500 MB each. This would break Canadian cluster configs. What about jobs that require more than 4000 MB RAM? Can the job then not run on your cluster?

samary commented 8 years ago

I understand your point but I don't think PBS will do this. As far as I know, if you query : qsub -l mem=5000mb,ppn=1,cpus=2 You will have one job with 2 cpus and 5000mb allowed ram for the 2 cpus (no scale up per cpus requested)

For your question : I will let it run with the default queue (without any queue defined), and let the cpu scale up working (as before), but the ram requirement has to be correct (above comment) . I don't think this will break anything for anyone. Right ?

briedel commented 8 years ago

We usually run something like #PBS -l mem=2500mb,ppn=1,cpus=2 in Canada. What matters is what gets reported to HTCondor in Madison. HTCondor will see 2 CPUs and a total memory of 5000 MB and will decide how to split up the split or not.

samary commented 8 years ago

Ok, but on the site level (torque), you are seen like requesting less of ram than what you truly want to use. So you can lead to memory starvation by misleading the batch-system. I think it worked on those sites because PBS cannot enforce this ram limitation (big problem known for a long time) without cgroups. So your jobs were never killed by the site. I don't really know how condor handle resources (ram per core), but there is something strange here, IMHO.

briedel commented 8 years ago

How would there be memory starvation if I ask PBS (torque, moab, whatever) to give me 2 cores with 2500 MB each and I use one core and 4000 MB, for example? The scheduler will reserved 5000 MB and 2 cores for me on a node. PBS does not care how I use the resources, just what I asked for. I don't see how we would be starved for memory except in the case that the job goes over the advertised memory requirement. Something we have no control over. I may be misleading the scheduler by requesting more CPUs than I need. Ultimately, that should just hurt me rather than help me though.

Our PBS implementation will keep track of RAM. Jobs get killed cause it used too much RAM at least once or twice a day. I know because I get an email.

Condor handles memory as blocks rather than per requested CPU:

request_memory = 5000
request_cpu=2

will give you 5000 MB and 2 cores. I am not sure how that is strange. It makes more sense than trying to figure out the memory per core requested.

dsschult commented 8 years ago

One thing I'm wondering is, could different variants of PBS see the memory request in different ways?

For instance, there are two ways of interpreting a request for 2CPUs, 5000MB:

I want 5000MB of memory and 2 CPUs
I want 2 CPU slots, both of which have 5000MB

Which one is actually correct? Maybe (2) is valid in Canada, and (1) is valid in IIHE.

The solution is probably to stop calling both of them PBS, and make a subclass for one of them.

samary commented 8 years ago

My point is : I'm not sure about the "each" in give me 2 cores with 2500 MB each. I don't have much experience with multiple cores. So I understand this as a request for the whole job and not per cpu.

samary commented 8 years ago

@dsschult : I think there might be different implementation or a misunderstanding (on my side). I always understood it as the case 1.

samary commented 8 years ago

@briedel : Which version of Torque are you running ? I'm wondering how you can enforce this limit without cgroups. We are stuck with this issue since a long time without any solution (except cgroups). If you have a solution, it will be very helpful ! This is one (among others) of the reasons we want to move to HTCondor too.

briedel commented 8 years ago

We are running both MOAB and torque here. Not quite sure how, but that is how it goes. I guess the difference is question mem vs. pmem Where pmem is per core and mem is total memory

Doing a quick test. If one sets mem == pmem*num_cpus the requested slot will still look the same. So one thing we could do is. Keep dividing the total memory requested by the num_cpus requested. And simply change the mem_per_core option for the different queues you can run and also set a max_requested_ram so it will ignore any jobs that will request too much memory for that queue

samary commented 8 years ago

Maybe MOAB is doing it, because PBS will try to enforce this limit by using ulimit on RSS memory but it doesn't work on "modern" linux kernel (http://stackoverflow.com/questions/3043709/resident-set-size-rss-limit-has-no-effect/6365534#6365534). Are you using cgroups ?

briedel commented 8 years ago

My guess is yes. I am not 100% sure what they use in the background.

samary commented 8 years ago

Ok, that explains why you can get you job killed. We never could. Only cgroups seems to be working. I'll make some tests to see how the memory request is handled by our system and I come back to you. Cheers

samary commented 8 years ago

So, after some investigations and tests, here is what I understand :

mem is set for the whole job (cf http://docs.adaptivecomputing.com/torque/4-1-3/Content/topics/2-jobs/requestingRes.htm)
pmem is set per process : cf official documentation (seems to be per core requested)

If you ask both mem and pmem, only the bigger one is taken : eg : qsub -l mem=4000mb,pmem=4000mb,nodes=1:ppn2 You will get 4000mb per core (means 2x4000mb for the whole job) : mem parameter ignored.

If you ask only mem, PBS will divide this mem per core requested : eg : qsub -l mem=4000mb,nodes=1:ppn2 You will get 2000mb per core (4000mb/2 cores)

So defining both of them doesn't make sense (in our case). We should only set pmem attribute to have the correct ram/core as expected.

briedel commented 8 years ago

I just realized that this pull request would break the calculation for the correct number of CPUs. This is needed to get the correct number of cores to accommodate the RAM request. I agree that we should just set pmem to avoid confusion.

samary commented 8 years ago

Proposed PR #25 to set only pmem.

samary commented 8 years ago

Fixed with #25 merged

WIPACrepo / pyglidein

[FIX] Don't divide requested memory per core (for PBS) #24