ganga-devs / ganga

Ganga is an easy-to-use frontend for job definition and management
GNU General Public License v3.0
100 stars 159 forks source link

Add multicore options to backend attribute #72

Closed egede closed 4 years ago

egede commented 8 years ago

For all backends, the ability to request multicore processing should be added.

We discussed at the meeting if this should be one or two attributes to allow to distinguish between as many cores as possible and give me X cores. As a request for as many as possible might anyway not give access to a full worker node, then I think it is best left as a single attribute.

Attribute

The attribute should be called ncore. The default value is 1. A value of -1 means as many as possible, a positive value means that you request this number of cores. Any other number (0 or below -1) results in an error at submission time.

Backend implementation

Each backend will have to transmit this request onwards as part of the submission. If the backend does not support it, any ncore number different from 1 should result in a failed submission.

Worker node implementation

On the worker node, an environment variable should be set that tells the number of cores accessible to the application. This should in the wrapper script be set in an environment variable GANGA_NCORE. For specific application objects in Ganga (i.e. Gaudi), we can just start up the application with the correct number of cores requested.

drmarkwslater commented 8 years ago

@egede I would just add that I think in addition to this, there should be an option for whole node scheduling (i.e. I want an entire empty node) as this would be the 3rd possible requirement someone might have.

egede commented 8 years ago

@drmarkwslater Why is this different from as many cores as possible. Is there a good use case for that 8 cores on a 16 core machine is worse than 8 on an 8 core machine?

drmarkwslater commented 8 years ago

@egede I'm not saying that it's worse per se, just that some users may want to specify that as they want to have an entire node to run on and grid sites are certainly starting offer this as well (though this is generally controlled through the pilots). I suppose this depends on the backend - if there is an option in Dirac for it then great, if not then there is no point in having it :)

By the way, what about wanting a minimum number of cores? e.g. I need at least 4 but I want to take over the rest of the node if I can? I guess this also comes down to the specific backend though...

egede commented 8 years ago

@drmarkwslater I think in fact in most cases this is what will happen (ie that if you request more than core you will tend to get a whole node). My point is that this from a user point of view seems irrelevant though.

You issue about a minimum number of cores is a good one though. So should we have an interval instead? Specified via two attributes like ncoremin and ncoremax or should we have a single attribute that takes a list or a tuple with exactly two elements? So like

j = Job(backend=LSF())
j.backend.ncores=[8,-1] # as many as possible but minimum 8
j.backend.ncores=[1,1] # Default value of a single core
j.backend.ncores=[2,8] # Minimum 2 but I will not efficiently use more than 8

The individual backends might of course still not be able to support the full syntax here and could throw an exception, with an explanation, at submission time if they can't.

po10 commented 8 years ago

A couple of things that came up today:

For the grid, theres very little use case for allowing users to submit jobs with more than 8 cores. So to avoid jobs waiting forever we should not allow jobs to be submitted above 8 cores.

Also, it was suggested that we allow users to specify a range of cores, instead of a single number.

On Thu, Dec 17, 2015 at 4:05 PM, Ulrik Egede notifications@github.com wrote:

@drmarkwslater https://github.com/drmarkwslater I think in fact in most cases this is what will happen (ie that if you request more than core you will tend to get a whole node). My point is that this from a user point of view seems irrelevant though.

You issue about a minimum number of cores is a good one though. So should we have an interval instead? Specified via two attributes like ncoremin and ncoremax or should we have a single attribute that takes a list or a tuple with exactly two elements? So like

j = Job(backend=LSF()) j.backend.ncores=[8,-1] # as many as possible but minimum 8 j.backend.ncores=[1,1] # Default value of a single core j.backend.ncores=[2,8] # Minimum 2 but I will not efficiently use more than 8

The individual backends might of course still not be able to support the full syntax here and could throw an exception, with an explanation, at submission time if they can't.

— Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/72#issuecomment-165477784.

egede commented 8 years ago

@po10

For the grid, theres very little use case for allowing users to submit jobs with more than 8 cores. So to avoid jobs waiting forever we should not allow jobs to be submitted above 8 cores.

This we can just implement in the Dirac backend. It will throw an exception if the maximum is above 8. For the Local backend, it might make sense to do the same if the number of cores specified is above what is actually available.

Also, it was suggested that we allow users to specify a range of cores, instead of a single number.

So this fits well with suggestion of giving an interval as made above.

fstagni commented 5 years ago

Hi, I write in this old task to know if progress have been made on Ganga side regarding the multiprocessor jobs (and nodes). On DIRAC side there have been in these years some progress in this sense (basically , for what interest you, a Job().setNuumberOfProcessors() call in the APIs, and machinery on the pilot+matching side).

alexanderrichards commented 5 years ago

Thanks for this @fstagni we have had this on hold for a while. If the API is now this easy it should be easy to include. In fact we have a generic catchall options that would allow us to use it already but I think at our meeting tomorrow we can discuss if we have a more explicit interface.

fstagni commented 5 years ago

BTW, please also consider previous discussions in https://its.cern.ch/jira/browse/LHCBDIRAC-496

alexanderrichards commented 5 years ago

BTW, please also consider previous discussions in https://its.cern.ch/jira/browse/LHCBDIRAC-496

I seemingly dont have access to view this. maybe @mesmith75 can you take a look

egede commented 4 years ago

Hi, (cc-ing the Ganga team)

We have recently introduced a new site for MultiProcessor jobs (we might have more than one, in fact). This is not advertised yet to the users as it's still somewhat unstable, but I think it's time to give it a try. Would you like to give it a try?

@Jinlin: can you try to submit your jobs using the following options?

either

j.backend.diracOpts( 'setTag(["MultiProcessor"])') j.backend.diracOpts( 'setTag(["8Processors"])') # If your jobs want 8 processors

or

j.backend.diracOpts( 'setTag(["MultiProcessor"])') j.backend.diracOpts( 'setTag(["WholeNode"])') # This would "grab everything" but I would not fully advocate it

Once you do it (please start with not many jobs, please) let us know.

egede commented 4 years ago

Hi,

We have introduced a while ago the job option which is NumberOfProcessors. It is enough to set it to the desired number of cores and do not use tags (MultiProcessor, 8Processor) which will be generated automatically. These tags are considered to be an internal DIRAC implementation detail. This does not concern the WholeNode tag.

Cheers, Andrei

fstagni commented 4 years ago

With some info: https://github.com/DIRACGrid/DIRAC/blob/rel-v7r0/Interfaces/API/Job.py#L559

egede commented 4 years ago

With some info: https://github.com/DIRACGrid/DIRAC/blob/rel-v7r0/Interfaces/API/Job.py#L559

Looks easy.

mesmith75 commented 4 years ago

What exactly do we want to expose and add to the schema?

The thing in the DIRAC API involves 3 numbers - do we want to add them all as separate attributes to the Dirac class? I guess some kind of dict would be best, although then you don't have control over the keys.

egede commented 4 years ago

I think that we can get away with two numbers: minProcessors and maxProcessors. By default they will both take the value 1. Submission will fail if either number is below 1 or if min>max.

egede commented 4 years ago

@fstagni Is there any way that a multiprocessor job will know how many CPUs it has been allocated once it is running (i.e. will there be an environment variable or something similar)?

fstagni commented 4 years ago

Programmatically, the most precise way (what's used right now for the MP production jobs) is to call this function: https://github.com/DIRACGrid/DIRAC/blob/rel-v7r0/WorkloadManagementSystem/Utilities/JobParameters.py#L157

DIRAC also sets a DIRAC_PROCESSORS environment variable per job, but for this I have a question to @atsareg : is this variable correct for the case of PoolCE?

egede commented 4 years ago

Programmatically, the most precise way (what's used right now for the MP production jobs) is to call this function: https://github.com/DIRACGrid/DIRAC/blob/rel-v7r0/WorkloadManagementSystem/Utilities/JobParameters.py#L157

Is the environement on the worker node setup such that I from a script running on the worker node just do

from DIRAC.WorkloadManagementSystem.Utilities.JobParameters import getNumberOfJobProcessors

Does this have to be python2 or will it work in python3 as well?

atsareg commented 4 years ago

DIRAC_PROCESSORS in the case of PoolCE shows the number processors that were available at the moment of picking up the user job. So, it does not reflect the job requirements.

fstagni commented 4 years ago

Is the environement on the worker node setup such that I from a script running on the worker node just do

from DIRAC.WorkloadManagementSystem.Utilities.JobParameters import getNumberOfJobProcessors

Does this have to be python2 or will it work in python3 as well?

The environment should be OK, but not python3. We can also encapsulate this in a DIRAC script, of course. How would you like the number of processors to be exposed? If you want an environment variable we should expose it via DIRAC (@atsareg this would need to be exposed for the various "InnerCEs" I think - InProcess, Pool, Sudo, maybe only Pool would need to be patched? To check).

atsareg commented 4 years ago

I think PoolCE can be relatively easily patched to expose DIRAC_JOB_PROCESSORS environment variable for the number of processors requested by (and allocated for) the job. I will make a patch

fstagni commented 4 years ago

Hello, can I get notified when #1742 goes into production in LHCb? Thanks!