Closed aldbr closed 4 days ago
CC @phicharp as he made lots of work in this part
I first started writing specific comments but I realize we first need some long awaited definitions, that I would be happy if we could discuss and then add to https://dirac.readthedocs.io/en/latest/AdministratorGuide/Systems/WorkloadManagement/Jobs/index.html:
cpu
for real time seconds of cpu used? Please clarify.It would be great if the definitions above could be also used consistently throughout the code.
Please, don't mix seconds and minutes: always use seconds.
In general, in DIRAC we should never make decisions based on the PUTime, but always on PUWork
(which means always using PUTime
and PUPower
together.)
And regarding the actually proposed solutions: I agree with everything you are proposing. On your last point:
Decrease CPUTimeLeft compared to the reality before requesting a job?
To make sure the matched job will be able to run, we could decrease a bit the CPUTimeLeft to take into account the time spent to match the job. How to decrease it? Using the number of processors used and a random number of seconds?
Normally payloads last long enough to make it possible to "ignore" the matching time. This is clearly a non-precise assumption, but normally ~correct. Maybe you just add a fixed amount of time, to make it simple?
If we are to define the terminology, we should also keep in mind that the job JDL parameter CPUTime is actually CPUWork according to the definitions above. And this is an interface parameter, so we should think twice before touching it. BDII defines maxCPUTime for a queue which is really CPUTime because the CPUPower is defined separately (SI00 parameter). So, if we define the terminlology we should adapt it everywhere consistently.
As for reducing the time left before sending it to the matcher, I would suggest the following. If the batch system time left utility gets the estimate from the batch system, then use it as it is. If not, then the total initial time in the slot is reduced by a factor, let say 20%, and used for the estimation at each job agent cycle. We better have a safety margin than killing the job because of misinterpretation of normalization factors.
What BDII defines is quite terrible, and I would suggest that at least those parameters are renamed to the correct terminology (if we agree on the terminology itself, first of all). The CPUTime
JDL parameter is misleading, but in the past we very slowly renamed also JDL parameters (starting by duplicating them).
CPUTime in the JDL is in fact very confusing for users. They naturally tend to interpret it as real CPUTime whatever the CPU is. So, changing its interpretation is not a large problem, mostly for those users that use it properly, suddenly they will start to require too much CPU having jobs waiting forever. If we introduce CPUWork JDL parameter, I think we still have to interpret CPUTime as real cpu time of an average processor, e.g. 10 HEPSPEC'06.
The time estimate is rather tricky in case of the PoolComputingElement. In this case normally we have to take into account the already running jobs, estimate how much CPU they will still consume and subtract it from the timeLeft to be sent to the matcher. But for that we have to rely on the CPU requirement provided by users...
We can't estimate the CPUWork that already running jobs on PoolCEs will consume, but we can use the value that their JDLs store (in CPUTime
-> CPUWork
) and simply subtract it from the CPUWorkLeft
at matching time. But of course as you write this relies on what's in the JDL (users' provided!)
(C)PUTimeLeft is the real time allowed in the queue. unit: real seconds. In case there's only one payload (so, no filling mode) then this is equivalent to WallClockTime (see next). You also used cpu for real time seconds of cpu used? Please clarify.
If I understand correctly your comment, the current definition of (C)PUTimeLeft
in the code would be (C)PUWorkLeft
as it is actually in HS06s.
And I would define (C)PUTime
as the user+system time in seconds. Is that really equivalent to WallClockTime
? Doesn't it depend on the nature of the payload and the number of processors available (cpu-bound or I/O-bound task)?
But indeed, in SLURM, WallClockTime
seems equal to CPUTime * number of processors
because the time allocation limit is defined by the WallClockTime
. From what I understand, this limit, in other batch systems, can be defined in (C)PUTime
right?
Can it depend both on (C)PUTime
and WallClockTime
?
In my example, what is defined as cpu
is actually what I would call (C)PUTime
elapsed (the user+system elapsed time in seconds).
WallClockTime meaning the elapsed real time of a single payload: unit: real seconds.
Why only a single payload?
If I apply these definitions on the TimeLeft
formulas above:
CPUWorkLeft = (CPUTimeLimit - CPUTime) * CPUPower
CPUTime = CPUTimeLimit - CPUWorkLeft / CPUPower
WallClockTime = CPUTime / numberProcs
Is that right?
Normally payloads last long enough to make it possible to "ignore" the matching time. This is clearly a non-precise assumption, but normally ~correct. Maybe you just add a fixed amount of time, to make it simple?
Yes, I think in most of the case we can ignore the matching time, I assume it would rarely fail. Adding a fixed amount of time like 3seconds ("long" matching time) * 25 (biggest normalization factor) would be a solution I guess.
The time estimate is rather tricky in case of the PoolComputingElement. In this case normally we have to take into account the already running jobs, estimate how much CPU they will still consume and subtract it from the timeLeft to be sent to the matcher. But for that we have to rely on the CPU requirement provided by users.
We can't estimate the CPUWork that already running jobs on PoolCEs will consume, but we can use the value that their JDLs store (in CPUTime -> CPUWork) and simply subtract it from the CPUWorkLeft at matching time. But of course as you write this relies on what's in the JDL (users' provided!)
Indeed, it's the simplest solution but doesn't work in the case CPUTime
depends on WallClockTime
as the CPUTime
is constantly decreasing. In this case, the CPUWorkLeft
should correspond to the WallClockTime
left or CPUWorkLeft / numberProcs
because:
CPUWork
superior to CPUWorkLeft
but will not be able to run until completion if it is also inferior to CPUWorkLeft / numberProcs
.CPUPower
are used by each individual process, we have to take into account that one of the process can consume 99% of the CPUWork
allocated.
Thus, I would go for the following solution:# Compute cpuPowerLeft:
cpuWorkLeftBS = compute_CPUWorkLeft()
if cpuTime depends on wallClockTime:
cpuWorkLeft = cpuWorkLeftBS / numberProcs (or simply wallClockTimeLeftBS)
else:
if multi-proc resource:
cpuWorkLeft -= jobCPU
else:
cpuWorkLeft = cpuWorkLeftBS
# Fetch a job according to cpuPowerLeft and numberProcsLeft
job = matcher.requestJob()
...
result = submitJob(job)
...
# Get CPUPower to decrease cpuPowerLeft next cycle before fetching a new job
if result['OK']:
jobCPUWork = job.CPUWork
else:
jobCPUWork = 0
But:
cpuTime depends on wallClockTime
condition?Another point is that, in the case CPUTime
doesn't depend on WallClockTime
, CPUWork
allocated to a job is lost, even if the job doesn't consume it.
One possible solution would lie in the usage of the WatchDog
. I don't have much knowledge about it yet, but, couldn't we get CPUWork
effectively consumed by a job from it? Then if such is the case, the result could be written in a file that the JobAgent
could read to add value to CPUWorkLeft
.
If I understand correctly your comment, the current definition of (C)PUTimeLeft in the code would be (C)PUWorkLeft as it is actually in HS06s. Yes. We have not been very good with naming things and this is one of the many points where we can apply some clarity, that's why I listed the definitions above. And I would define (C)PUTime as the user+system time in seconds. Is that really equivalent to WallClockTime? Doesn't it depend on the nature of the payload and the number of processors available (cpu-bound or I/O-bound task)? If it's real seconds IMHO we should not count the number of CPUs, but some batch system may have different ideas (indeed, see SLURM).
Your TimeLeft
formulas look OK to me.
There is at the same time something that I still don't grasp in your definitions and calculations for the case of multiprocessor resources and jobs, so I go for an example:
There are 3 jobs in the queue:
CPUWork: 1000 HS06s
NumberOfProcessors: 1
CPUWork: 3000 HS06s
NumberOfProcessors: 2
CPUWork: 3000 HS06s
NumberOfProcessors: 2
CPUWork: 500 HS06s
NumberOfProcessors: 1 or 2
And they are all queueing at a queue that has:
NumberOfProcessors: 4
CPUTimeLimit: 100 s
The pilot starts and calculate a CPUPower of 10 HS06. Since this is done for 1 processor, we assume that the queue has a CPUPower of 40 (this is really bad, I admit, but we don't have other measures!). We calculate that:
CPUWorkLeft = 100 * 40 = 4000 HS06s
The question is: which job(s) can we match? (if we consider the matching time 0, for simplifying).
The Watchdog is not helping in this case I am afraid.
Dear all,
Jumping a bit late into the game, so a bit hard to recover, but I think all the comments made so far are sensible... I am adding here some historical background and personal considerations...
IMHO, we should first take the definitions Federico recalled and fix all variable names in the code, including JDL parameters, keeping backward compatibility, if feasible (using double parameters would not harm for a while provided it is really for a while!)...
We have to deal with:
I would go as far as not specifying any longer HS06 as the current DIRACBenchmark is not really scaling any longer with the old deprecated HS06... but please consider that BDII still uses SI2k (a.k.a. SI00) as benchmark, while HS06 has been adopted by WLCG in 2007... FYI the conversion is a plain factor 250 which is indeed a rounding of a measured 239 at the time, but it was making 1 kSI2K = 4 HS06 which was simpler...
B.t.w. we are not even using DB12, as in order to rescale DB to match more or less HS06, I have introduced a scaling factor Operations/JobScheduling/CPUNormalisationCorrection
that is currently set to 0.65 (was set in 2016, hence leading to DB16 unit ;-) )
The original hope was that in addition to DB12/16 we would get generalised so-called MJF (Machine and Jobs Features) everywhere that would provide an agreed benchmarking of the local CPU (in whatever unit, there was discussion about an HS14, or using DB12 or...) in machineFeatures, as well as CPU and WallClock time available to the job in JobFeatures. This was in order to avoid everybody to develop plugins for each and every batch system, and was a way to convey the information for non-batch WNs like VMs or containers (in which case the MJF information would be provided using http).
As more batch systems became available and sites started to use specific features of the existing batch systems, the plugins in DIRAC became essentially unmaintainable and were really unmaintained... LSF is a nightmare as is Grid Engine due to the many parameters that system managers can introduce...
Concerning CPUTime and WallClockTime, again there was never a consensus in WLCG on how to define those, although it sounds rather logical to have a limit in CPUTime < the limit in WallClockTime since the CPU efficiency (CPUTime/WallClockTime should always be less than 1, caveat the bit of multithreading that is unavoidable in all applications).
The proposal (that I made centuries ago) was to have a CPU limit (in real seconds) on whicc we could evaluate the available CPUWork and match only jobs that require less work than that limit. And to have a WallClockTime limit much higher just in order to avoid idle jobs to keep a slot forever, i.e. for example 50% larger than CPUTimeLimit. If the job behaves normally it would hit CPUTimeLimit before WallClockTimeLimit (that was not agreed by all experiments as several had <30% efficiencies :()
In the real world, a lot of sites set the CPUTimeLimit == WallClockTimeLimit, which means that under normal circumstances WallClock is hit before CPU... One way out would be to assume jobs are well-behaving and have an efficiency ~80-90%, thus scale down the CPUWork capacity of the slot. Anyway due to the fuzziness of the power measurement (related to many factors), one has to keep a safety merging of the order to 20-30% to match jobs (and CPUWork requirements of jobs are also usually evaluated with a safety margin, except for MCSimulation where the number of events is derived from the capacity of the slot, but also with a 30% safety margin).
All the above stands for single processor slots. For multi-processor slots, not much difference except that the CPUTime limit should be scaled (approximately) with the number of processors, as alluded to by Alexandre... Of course WallClockTime remains what it is ;-).
Another remark with the absurd parameters CPUScalingFactor
and CPUNormalisationFactor
... The existence of these 2 parameters is historical:
Normalisation
was used and Scaling
was a second parameter introduced by the author in order to be able to play around with other ways to "normalise" time I abused this when MJF was tentatively introduced and was using ScalingFactor
for reporting the MJF power when available. When it was not available it was set == NormalisationFactor
and the trick to know whether MJF was there was that the DB12/16 power was given with a single digit while the MJF was given with 2 digits... This is awful, but introduced minimal code changes... All this should of course be fixed as DB12Power
and MJFPower
in the code and in the local CS parameters...
B.t.w. this is also an ugly feature of this utility to pass those parameters as being CS parameters... It works but is rather confusing, as one never really knows what is made persistent in a local .cfg
file and what is only in memory.... one gets used to it but it is not very clear...
I hope this is not introducing more confusion than needed to an already extremely confusing item...
Of course there is always the ATLAS/CMS option: don't care and if jobs fail, they fail... They always assume jobs can run on any slot, even in cases when they use filling mode (but I think they abandoned that)... B.t.w. it is not clear to me whether using filling mode is really very useful if jobs have a reasonably long duration, and due to all the problems observed with the CPUTimeLeft
utility, at many sites we are not using filling mode tr just 2 jobs ans "tant pis" if they don't fit!
Note that, as Andrei pointed out, users usually have not a single clue of what to put as a requirement and they just rely on the ganga default!
I see in the code that there are already variables like DB12
and DB12Measured
in the local CS, set by the dirac-wms-cpu-normalization
script... Already a step forward ;-)
Thanks for your comments Philippe. A couple notes:
IMHO, we should first take the definitions Federico recalled and fix all variable names in the code, including JDL parameters, keeping backward compatibility, if feasible (using double parameters would not harm for a while provided it is really for a while!)...
Yes yes yes!
B.t.w. we are not even using DB12, as in order to rescale DB to match more or less HS06, I have introduced a scaling factor Operations/JobScheduling/CPUNormalisationCorrection that is currently set to 0.65 (was set in 2016, hence leading to DB16 unit ;-) )
This is in LHCb, for everyone else it's still DB12. Or, we could adjust directly in the code. But, as mentioned below, DBxx is thought for a single-processors era.
B.t.w. it is not clear to me whether using filling mode is really very useful if jobs have a reasonably long duration, and due to all the problems observed with the CPUTimeLeft utility, at many sites we are not using filling mode tr just 2 jobs ans "tant pis" if they don't fit!
Even before using filling mode (which is a JobAgent option), there's a "hidden filling mode" in Pool CE which is what is used for multiprocessor nodes (and tricky for many-core architectures where we "play tetris".
So, if I try to sum up what we said:
To get the right job to run on a Worker Node, we need:
Time: Time left estimation in real seconds that we can get from the LRMS but:
CPUTimeLimit == WallClockTimeLimit
, thus WallClock is hit before CPU: need to to assume jobs are well-behaving and have an efficiency ~80-90%, scale down the CPUTime capacity of the slot (so that CPU finishes before WallClockTime).CPUTimeLimit >> WallClockTimeLimit
: need to check that CPUTimeLimit/NbProcessors <= WallClockTimeLimit
, scale down the CPUTime capacity of the slot again.Questions:
It is not clear to me whether using filling mode is really very useful if jobs have a reasonably long duration, and due to all the problems observed with the CPUTimeLeft utility
- Do we have an idea of the time wasted, that was allocated but not used due to the filling mode disabled? Of the time a pilot take to go from "submitted" to "running"?
Considered solutions:
Use current plugins and do not use filling mode (current situation):
Maintain the current plugin:
Find already existing batch system plugins that would always provide an accurate value of the time left:
Do not use any plugin and run jobs similar to the ones that already ran in a specific queue:
Do not use any plugin and do not use filling mode:
Power: Dirac Benchmark (DBXX) aims at reproducing HS06 but:
Questions:
Considered solutions:
Use DBXX as it is now for multi-processors (current situation):
CPUPower == Power of one processor
, the pilot would not fetch largest jobsCPUPower == Power of one processor * nbProcessors
, the pilot would fetch jobs that would likely not run, and would need further changesImprove DBXX to work with multi-processors:
Do no use DBXX and run jobs similar to the ones that already ran in a specific queue:
Short term solution:
CPU Time computation (in TimeLeft
):
factor = 0.8 (?)
dependsOnWallClockTime = False
cpuTimeLeft, wallClockTimeLeft = getCPUWallClockTimeLeft()
if wallClockTimeLeft * factor < cpuTimeLeft:
# In this case, wallClockTimeLeft may influence cpuTimeLeft and has to be taken into account
if wallClockTimeLeft * factor >= cpuTimeLeft / nbProcs:
# Here, we use a small portion of the cpuTime to avoid fetching jobs that could not run within wallClockTimeLeft
cpuTimeLeft = cpuTimeLeft / nbProcs
else:
# Here, wallClockTimeLeft is too low to be ignored, we use it to fetch the jobs
cpuTimeLeft = wallClockTimeLeft * factor
dependsOnWallClockTime = True
cpuWorkLeft = cpuTimeLeft * cpuPower
Example:
wallClockTimeLeft = 1000000
cpuTimeLeft = 10000
nbProcs = 4
1000000 * 0.8 < 10000 is False
In this case, wallClockTime is so large that it can be ignored, we would use the original cpuTimeLeft value.
- 2nd case:
wallClockTimeLeft = 10000 cpuTimeLeft = 1000000 nbProcs = 4
10000 0.8 < 1000000 is True 10000 0.8 >= 1000000 / 4 is False cpuTimeLeft = 10000 * 0.8
wallClockTimeLeft is too small compared to cpuTimeLeft to be ignored, it will certainly hit 0 before it and have to be taken into account.
- 3rd case:
wallClockTimeLeft = 10000 cpuTimeLeft = 10000 nbProcs = 4
10000 0.8 < 10000 is True 10000 0.8 >= 10000 / 4 is True cpuTimeLeft = 2500
wallClockTimeLeft is similar to cpuTimeLeft, but we cannot use it as we have multiple processors. 2 jobs, requiring 1 processor each, and `wallClockTimeLeft * factor` seconds would fail (8000 + 8000 > 10000).
*CPU Power*:
No change for the moment.
Example:
Let's use your example, @fstagni .
If we don't change anything, CPUPower computed is 10 for 1 processor, and we keep it like this.
Thus, `CPUPowerLeft = 100 * 10 = 1000`.
* jobSP: run it `(1000 == 1000)`
* jobMP2: doesn't run it `(3000 > 1000)`
* jobMP4: doesn't run it `(3000 > 1000)`, but in theory it could
* jobSPMP: run it `(500 < 1000)`
*CPU Work computation (in `JobAgent`)*:
cpuWorkLeft, dependsOnWallClockTime = getCPUWorkLeft() if not dependsOnWallClockTime and nbProcessors > 1:
cpuWorkLeft -= jobCPU
Any comment on the *considered solutions*? On the short term solution?
Good summary Alexandre.
Time:
Do the plugins return an incorrect time left result because of the specific features of the batch systems? Because of the output that is not easily parseable?
Because of the output that is not easy to parse
If plugins do not return the right CPUTimeLeft value, do the elastic MC simulation jobs work "by chance"?
Most of the time they work because:
It is not clear to me whether using filling mode is really very useful if jobs have a reasonably long duration, and due to all the problems observed with the CPUTimeLeft utility
For Single Processor jobs in Single Processor job slots, filling mode has limited utility. For the case of MultiProcessor job slots, it's basically a must from the day we start to "play tetris"
Do we have an idea of the time wasted, that was allocated but not used due to the filling mode disabled? Of the time a pilot take to go from "submitted" to "running"?
We have never recorded these values anywhere.
I am afraid that the only solution is maintaining the current plugins. Mostly the SLURM one as this is what's used in HPCs.
Power:
Why did we need a CPUNormalisationCorrection?
Not to scale with HS06, but to scale with the LHCb applications (namely Gauss)
Which main factors can impact the power measurement?
Please ask Andrea Valassi (CERN) as he's LHCb liason in the Benchmarking WG of WLCG.
Improve DBXX to work with multi-processors: Would it be hard?
No idea...
Right now we can ~reliably only stick with DB12. There are no other ready-to-use solutions on the horizon, AFAIK. But again, ask Andrea Valassi.
Short term solution
Go ahead with it.
Update
The original issue was related to the use of Slurm, which is a batch system only based on wallclock time, in multi-core environments.
We proposed a SlurmResourceUsage
(https://github.com/DIRACGrid/DIRAC/pull/4482, https://github.com/DIRACGrid/DIRAC/pull/4673) plugin able to compute an accurate value of the wallclock time left in an allocation, and we use the value to compute the cpu work afterwards.
Slurm includes a CLI very easy to parse, and allows us to execute multiple SP or MP jobs in parallel.
We also introduced an option to reduce the duration of the JobAgent
cycles (https://github.com/DIRACGrid/Pilot/pull/106), so that the "cpu work left" value remains almost correct between the end of a cycle and the beginning of a new one (no need for further modifications in the JobAgent
).
The original issue is then resolved; but we also raised many other points at the same time:
other batch systems in multi-core environments: IIUC, Slurm is gaining ground in HPC while HTCondor (https://github.com/DIRACGrid/DIRAC/pull/4972) is becoming the standard in grid environments (LSF, PBS and SGE seem to be less and less used). Slurm and HTCondor seem to be both based on wallclock time and we can easily extract the "time left" in the allocations, and thus run multiple MP and SP jobs in parallel in these environments. Therefore, I think we can keep those ResourceUsage
modules for the moment, and run jobs in parallel based on the wallclock time in Slurm and HTCondor. Of course, if we need to run jobs in parallel in other batch systems such as LSF, PBS or SGE, we will probably need to build a more complex solution (which would probably correspond to the "Short term solution" that I defined above).
cpu power in multi-core environments: from what I've experienced, it seems cpu power (DB12-16) is not adapted to multi-core environments and should be revised accordingly. Indeed, in multi-core environments, we compute the cpu power of 1/n hardware threads: the result can be non representative of the cpu power of the other hardware threads in an allocation, which can lead to jobs running out of time. There is a method to run DB12 on multiple hardware threads in parallel but does not seem to run in parallel actually.
cpu time/power/work definitions: I've not started to make changes
I am closing this issue as we are going to move to DiracX (with CWL at some point). We will leverage Pydantic models and this will "force" us to clearly define cpu time/power/work definitions and units.
Here is a potential issue I discovered running pilots on a SLURM batch system which could batch systems based on wallclock time (when cpu time left depends on real time left in seconds). Here is an example:
Variables
Pilot
17.3
17.3
SLURM Job
600
14400
On SLURM, WallClockTimeLimit and CPUTimeLimit are bound each other
Job
18700
TimeLeft formula
To get
CPUTimeLeft
(hepspec06):To get
cpu
(seconds) fromCPUTimeLeft
:To get
wallClockTime
fromcpu
(seconds):JobAgent Execution
pilot.cfg
in/LocalSite/CPUTimeLeft
(hepspec06).According to the table, a job having CPUTime = 18700 should pass Cycle 8 but:
Problem
Time has passed between these cycles and has not been considered before the job matching. Thus, a job is matched according to the CE TimeLeft (20760) while real time is much less than that: the value has been added to the cfg around WallClockTime = 9:30.
We compute the minimum wallclock time left to compute a job having CPUTime = 18700 using the TimeLeft formula:
Then to get the wall clock time, we divide cpu by numberCPUs on the node:
The node would spend 45 seconds to execute such a job. Nevertheless, in this case, matching occured around 9:30, which only gives 30 seconds to the job to complete its execution: 45 > 30. Time left is incorrect and not sufficient.
Same problem may happen if the first job fetched has to be executed in a very limited time. Time left is originally computed the step before the JobAgent is launched.
Proposed solution
Current code
Move the CPUTimeLeft computation block just before setting it in the CE and the cfg file
CPUTimeLeft recorded will be closer to the real value
Calculate and record CPUTimeLeft just before the requesting a job from the Matcher to have a more accurate measure.
Matcher will use the CPUTimeLeft computed just before, and should match a better job regarding the real time left.
Compute CPUTimeLeft even if fillingMode is disabled
When fillingMode is disabled, the matcher uses the CPUTimeLeft computed before the launch of the JobAgent.
Decrease CPUTimeLeft compared to the reality before requesting a job?
To make sure the matched job will be able to run, we could decrease a bit the CPUTimeLeft to take into account the time spent to match the job. How to decrease it? Using the number of processors used and a random number of seconds?