Closed egede closed 4 years ago
@egede I would just add that I think in addition to this, there should be an option for whole node scheduling (i.e. I want an entire empty node) as this would be the 3rd possible requirement someone might have.
@drmarkwslater Why is this different from as many cores as possible. Is there a good use case for that 8 cores on a 16 core machine is worse than 8 on an 8 core machine?
@egede I'm not saying that it's worse per se, just that some users may want to specify that as they want to have an entire node to run on and grid sites are certainly starting offer this as well (though this is generally controlled through the pilots). I suppose this depends on the backend - if there is an option in Dirac for it then great, if not then there is no point in having it :)
By the way, what about wanting a minimum number of cores? e.g. I need at least 4 but I want to take over the rest of the node if I can? I guess this also comes down to the specific backend though...
@drmarkwslater I think in fact in most cases this is what will happen (ie that if you request more than core you will tend to get a whole node). My point is that this from a user point of view seems irrelevant though.
You issue about a minimum number of cores is a good one though. So should we have an interval instead? Specified via two attributes like ncoremin and ncoremax or should we have a single attribute that takes a list or a tuple with exactly two elements? So like
j = Job(backend=LSF())
j.backend.ncores=[8,-1] # as many as possible but minimum 8
j.backend.ncores=[1,1] # Default value of a single core
j.backend.ncores=[2,8] # Minimum 2 but I will not efficiently use more than 8
The individual backends might of course still not be able to support the full syntax here and could throw an exception, with an explanation, at submission time if they can't.
A couple of things that came up today:
For the grid, theres very little use case for allowing users to submit jobs with more than 8 cores. So to avoid jobs waiting forever we should not allow jobs to be submitted above 8 cores.
Also, it was suggested that we allow users to specify a range of cores, instead of a single number.
On Thu, Dec 17, 2015 at 4:05 PM, Ulrik Egede notifications@github.com wrote:
@drmarkwslater https://github.com/drmarkwslater I think in fact in most cases this is what will happen (ie that if you request more than core you will tend to get a whole node). My point is that this from a user point of view seems irrelevant though.
You issue about a minimum number of cores is a good one though. So should we have an interval instead? Specified via two attributes like ncoremin and ncoremax or should we have a single attribute that takes a list or a tuple with exactly two elements? So like
j = Job(backend=LSF()) j.backend.ncores=[8,-1] # as many as possible but minimum 8 j.backend.ncores=[1,1] # Default value of a single core j.backend.ncores=[2,8] # Minimum 2 but I will not efficiently use more than 8
The individual backends might of course still not be able to support the full syntax here and could throw an exception, with an explanation, at submission time if they can't.
— Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/72#issuecomment-165477784.
@po10
For the grid, theres very little use case for allowing users to submit jobs with more than 8 cores. So to avoid jobs waiting forever we should not allow jobs to be submitted above 8 cores.
This we can just implement in the Dirac backend. It will throw an exception if the maximum is above 8. For the Local backend, it might make sense to do the same if the number of cores specified is above what is actually available.
Also, it was suggested that we allow users to specify a range of cores, instead of a single number.
So this fits well with suggestion of giving an interval as made above.
Hi,
I write in this old task to know if progress have been made on Ganga side regarding the multiprocessor jobs (and nodes). On DIRAC side there have been in these years some progress in this sense (basically , for what interest you, a Job().setNuumberOfProcessors()
call in the APIs, and machinery on the pilot+matching side).
Thanks for this @fstagni we have had this on hold for a while. If the API is now this easy it should be easy to include. In fact we have a generic catchall options that would allow us to use it already but I think at our meeting tomorrow we can discuss if we have a more explicit interface.
BTW, please also consider previous discussions in https://its.cern.ch/jira/browse/LHCBDIRAC-496
BTW, please also consider previous discussions in https://its.cern.ch/jira/browse/LHCBDIRAC-496
I seemingly dont have access to view this. maybe @mesmith75 can you take a look
Hi, (cc-ing the Ganga team)
We have recently introduced a new site for MultiProcessor jobs (we might have more than one, in fact). This is not advertised yet to the users as it's still somewhat unstable, but I think it's time to give it a try. Would you like to give it a try?
@Jinlin: can you try to submit your jobs using the following options?
either
j.backend.diracOpts( 'setTag(["MultiProcessor"])') j.backend.diracOpts( 'setTag(["8Processors"])') # If your jobs want 8 processors
or
j.backend.diracOpts( 'setTag(["MultiProcessor"])') j.backend.diracOpts( 'setTag(["WholeNode"])') # This would "grab everything" but I would not fully advocate it
Once you do it (please start with not many jobs, please) let us know.
Hi,
We have introduced a while ago the job option which is NumberOfProcessors. It is enough to set it to the desired number of cores and do not use tags (MultiProcessor, 8Processor) which will be generated automatically. These tags are considered to be an internal DIRAC implementation detail. This does not concern the WholeNode tag.
Cheers, Andrei
With some info: https://github.com/DIRACGrid/DIRAC/blob/rel-v7r0/Interfaces/API/Job.py#L559
Looks easy.
What exactly do we want to expose and add to the schema?
The thing in the DIRAC API involves 3 numbers - do we want to add them all as separate attributes to the Dirac class? I guess some kind of dict would be best, although then you don't have control over the keys.
I think that we can get away with two numbers: minProcessors
and maxProcessors
. By default they will both take the value 1. Submission will fail if either number is below 1 or if min>max
.
@fstagni Is there any way that a multiprocessor job will know how many CPUs it has been allocated once it is running (i.e. will there be an environment variable or something similar)?
Programmatically, the most precise way (what's used right now for the MP production jobs) is to call this function: https://github.com/DIRACGrid/DIRAC/blob/rel-v7r0/WorkloadManagementSystem/Utilities/JobParameters.py#L157
DIRAC also sets a DIRAC_PROCESSORS
environment variable per job, but for this I have a question to @atsareg : is this variable correct for the case of PoolCE?
Programmatically, the most precise way (what's used right now for the MP production jobs) is to call this function: https://github.com/DIRACGrid/DIRAC/blob/rel-v7r0/WorkloadManagementSystem/Utilities/JobParameters.py#L157
Is the environement on the worker node setup such that I from a script running on the worker node just do
from DIRAC.WorkloadManagementSystem.Utilities.JobParameters import getNumberOfJobProcessors
Does this have to be python2
or will it work in python3
as well?
DIRAC_PROCESSORS in the case of PoolCE shows the number processors that were available at the moment of picking up the user job. So, it does not reflect the job requirements.
Is the environement on the worker node setup such that I from a script running on the worker node just do
from DIRAC.WorkloadManagementSystem.Utilities.JobParameters import getNumberOfJobProcessors
Does this have to be
python2
or will it work inpython3
as well?
The environment should be OK, but not python3. We can also encapsulate this in a DIRAC script, of course. How would you like the number of processors to be exposed? If you want an environment variable we should expose it via DIRAC (@atsareg this would need to be exposed for the various "InnerCEs" I think - InProcess, Pool, Sudo, maybe only Pool would need to be patched? To check).
I think PoolCE can be relatively easily patched to expose DIRAC_JOB_PROCESSORS environment variable for the number of processors requested by (and allocated for) the job. I will make a patch
Hello, can I get notified when #1742 goes into production in LHCb? Thanks!
For all backends, the ability to request multicore processing should be added.
We discussed at the meeting if this should be one or two attributes to allow to distinguish between as many cores as possible and give me X cores. As a request for as many as possible might anyway not give access to a full worker node, then I think it is best left as a single attribute.
Attribute
The attribute should be called ncore. The default value is 1. A value of -1 means as many as possible, a positive value means that you request this number of cores. Any other number (0 or below -1) results in an error at submission time.
Backend implementation
Each backend will have to transmit this request onwards as part of the submission. If the backend does not support it, any ncore number different from 1 should result in a failed submission.
Worker node implementation
On the worker node, an environment variable should be set that tells the number of cores accessible to the application. This should in the wrapper script be set in an environment variable
GANGA_NCORE
. For specific application objects in Ganga (i.e. Gaudi), we can just start up the application with the correct number of cores requested.