DIRACGrid / DIRAC

DIRAC Grid
http://diracgrid.org
GNU General Public License v3.0
112 stars 174 forks source link

JobAgent TimeLeft computation: definitions, multi-core environments, batch system based on wallclock time... #4544

Open aldbr opened 4 years ago

aldbr commented 4 years ago

Here is a potential issue I discovered running pilots on a SLURM batch system which could batch systems based on wallclock time (when cpu time left depends on real time left in seconds). Here is an example:

Variables

Pilot

SLURM Job

On SLURM, WallClockTimeLimit and CPUTimeLimit are bound each other

Job

TimeLeft formula

To get CPUTimeLeft (hepspec06):

CPUTimeLeft = (cpuLimit - cpu) * normalizationFactor

To get cpu (seconds) from CPUTimeLeft:

cpu = cpuLimit - CPUtimeLeft / normalizationFactor

To get wallClockTime from cpu (seconds):

wallClockTime = cpu / numberProcs

JobAgent Execution

Cycle CPUTimeLeft WallClockTime
0 187255 2:29
2 169816 3:11
3 145320 4:10
4 120408 5:10
5 95496 6:10
6 70584 7:10
7 45672 8:10
8 20760 9:10

According to the table, a job having CPUTime = 18700 should pass Cycle 8 but:

Problem

Time has passed between these cycles and has not been considered before the job matching. Thus, a job is matched according to the CE TimeLeft (20760) while real time is much less than that: the value has been added to the cfg around WallClockTime = 9:30.

We compute the minimum wallclock time left to compute a job having CPUTime = 18700 using the TimeLeft formula:

cpu = CPUTime / normalizationFactor
cpu = 18700 / 17.3 
cpu = 1081

Then to get the wall clock time, we divide cpu by numberCPUs on the node:

wallClockTime = 1081 / 24 
wallClockTime = 45sec

The node would spend 45 seconds to execute such a job. Nevertheless, in this case, matching occured around 9:30, which only gives 30 seconds to the job to complete its execution: 45 > 30. Time left is incorrect and not sufficient.

Same problem may happen if the first job fetched has to be executed in a very limited time. Time left is originally computed the step before the JobAgent is launched.

Proposed solution

Current code

CPUTimeLeft computation block
# end of cycle n

# beginning of cycle n+1
if fillingMode:
  Set CE.CPUTime = CPUTimeLeft
...
Match a job

Move the CPUTimeLeft computation block just before setting it in the CE and the cfg file

CPUTimeLeft recorded will be closer to the real value

# beginning of cycle n
if fillingMode:
  CPUTimeLeft computation block
  Set CE.CPUTime = CPUTimeLeft
...
Match a job

Calculate and record CPUTimeLeft just before the requesting a job from the Matcher to have a more accurate measure.

Matcher will use the CPUTimeLeft computed just before, and should match a better job regarding the real time left.

# beginning of cycle n
if fillingMode:
  CPUTimeLeft computation block
  Set CE.CPUTime = CPUTimeLeft
Match a job

Compute CPUTimeLeft even if fillingMode is disabled

When fillingMode is disabled, the matcher uses the CPUTimeLeft computed before the launch of the JobAgent.

# beginning of cycle n
CPUTimeLeft computation block
Set CE.CPUTime = CPUTimeLeft
Match a job

Decrease CPUTimeLeft compared to the reality before requesting a job?

To make sure the matched job will be able to run, we could decrease a bit the CPUTimeLeft to take into account the time spent to match the job. How to decrease it? Using the number of processors used and a random number of seconds?

fstagni commented 4 years ago

CC @phicharp as he made lots of work in this part

I first started writing specific comments but I realize we first need some long awaited definitions, that I would be happy if we could discuss and then add to https://dirac.readthedocs.io/en/latest/AdministratorGuide/Systems/WorkloadManagement/Jobs/index.html:

It would be great if the definitions above could be also used consistently throughout the code.

Please, don't mix seconds and minutes: always use seconds.

In general, in DIRAC we should never make decisions based on the PUTime, but always on PUWork (which means always using PUTime and PUPower together.)

fstagni commented 4 years ago

And regarding the actually proposed solutions: I agree with everything you are proposing. On your last point:

Decrease CPUTimeLeft compared to the reality before requesting a job?

To make sure the matched job will be able to run, we could decrease a bit the CPUTimeLeft to take into account the time spent to match the job. How to decrease it? Using the number of processors used and a random number of seconds?

Normally payloads last long enough to make it possible to "ignore" the matching time. This is clearly a non-precise assumption, but normally ~correct. Maybe you just add a fixed amount of time, to make it simple?

atsareg commented 4 years ago

If we are to define the terminology, we should also keep in mind that the job JDL parameter CPUTime is actually CPUWork according to the definitions above. And this is an interface parameter, so we should think twice before touching it. BDII defines maxCPUTime for a queue which is really CPUTime because the CPUPower is defined separately (SI00 parameter). So, if we define the terminlology we should adapt it everywhere consistently.

As for reducing the time left before sending it to the matcher, I would suggest the following. If the batch system time left utility gets the estimate from the batch system, then use it as it is. If not, then the total initial time in the slot is reduced by a factor, let say 20%, and used for the estimation at each job agent cycle. We better have a safety margin than killing the job because of misinterpretation of normalization factors.

fstagni commented 4 years ago

What BDII defines is quite terrible, and I would suggest that at least those parameters are renamed to the correct terminology (if we agree on the terminology itself, first of all). The CPUTime JDL parameter is misleading, but in the past we very slowly renamed also JDL parameters (starting by duplicating them).

atsareg commented 4 years ago

CPUTime in the JDL is in fact very confusing for users. They naturally tend to interpret it as real CPUTime whatever the CPU is. So, changing its interpretation is not a large problem, mostly for those users that use it properly, suddenly they will start to require too much CPU having jobs waiting forever. If we introduce CPUWork JDL parameter, I think we still have to interpret CPUTime as real cpu time of an average processor, e.g. 10 HEPSPEC'06.

atsareg commented 4 years ago

The time estimate is rather tricky in case of the PoolComputingElement. In this case normally we have to take into account the already running jobs, estimate how much CPU they will still consume and subtract it from the timeLeft to be sent to the matcher. But for that we have to rely on the CPU requirement provided by users...

fstagni commented 4 years ago

We can't estimate the CPUWork that already running jobs on PoolCEs will consume, but we can use the value that their JDLs store (in CPUTime -> CPUWork) and simply subtract it from the CPUWorkLeft at matching time. But of course as you write this relies on what's in the JDL (users' provided!)

aldbr commented 4 years ago

(C)PUTimeLeft is the real time allowed in the queue. unit: real seconds. In case there's only one payload (so, no filling mode) then this is equivalent to WallClockTime (see next). You also used cpu for real time seconds of cpu used? Please clarify.

If I understand correctly your comment, the current definition of (C)PUTimeLeft in the code would be (C)PUWorkLeft as it is actually in HS06s. And I would define (C)PUTime as the user+system time in seconds. Is that really equivalent to WallClockTime? Doesn't it depend on the nature of the payload and the number of processors available (cpu-bound or I/O-bound task)?

But indeed, in SLURM, WallClockTime seems equal to CPUTime * number of processors because the time allocation limit is defined by the WallClockTime. From what I understand, this limit, in other batch systems, can be defined in (C)PUTime right? Can it depend both on (C)PUTime and WallClockTime? In my example, what is defined as cpu is actually what I would call (C)PUTime elapsed (the user+system elapsed time in seconds).

WallClockTime meaning the elapsed real time of a single payload: unit: real seconds.

Why only a single payload?

If I apply these definitions on the TimeLeft formulas above:

Is that right?

Normally payloads last long enough to make it possible to "ignore" the matching time. This is clearly a non-precise assumption, but normally ~correct. Maybe you just add a fixed amount of time, to make it simple?

Yes, I think in most of the case we can ignore the matching time, I assume it would rarely fail. Adding a fixed amount of time like 3seconds ("long" matching time) * 25 (biggest normalization factor) would be a solution I guess.

The time estimate is rather tricky in case of the PoolComputingElement. In this case normally we have to take into account the already running jobs, estimate how much CPU they will still consume and subtract it from the timeLeft to be sent to the matcher. But for that we have to rely on the CPU requirement provided by users.

We can't estimate the CPUWork that already running jobs on PoolCEs will consume, but we can use the value that their JDLs store (in CPUTime -> CPUWork) and simply subtract it from the CPUWorkLeft at matching time. But of course as you write this relies on what's in the JDL (users' provided!)

Indeed, it's the simplest solution but doesn't work in the case CPUTime depends on WallClockTime as the CPUTime is constantly decreasing. In this case, the CPUWorkLeft should correspond to the WallClockTime left or CPUWorkLeft / numberProcs because:

# Compute cpuPowerLeft:
cpuWorkLeftBS = compute_CPUWorkLeft() 
if cpuTime depends on wallClockTime: 
  cpuWorkLeft = cpuWorkLeftBS / numberProcs (or simply wallClockTimeLeftBS)
else:
  if multi-proc resource:
    cpuWorkLeft -= jobCPU
  else:
    cpuWorkLeft = cpuWorkLeftBS

# Fetch a job according to cpuPowerLeft and numberProcsLeft
job = matcher.requestJob()
...
result = submitJob(job)
...
# Get CPUPower to decrease cpuPowerLeft next cycle before fetching a new job
if result['OK']:
  jobCPUWork = job.CPUWork
else:
  jobCPUWork = 0

But:

Another point is that, in the case CPUTime doesn't depend on WallClockTime, CPUWork allocated to a job is lost, even if the job doesn't consume it. One possible solution would lie in the usage of the WatchDog. I don't have much knowledge about it yet, but, couldn't we get CPUWork effectively consumed by a job from it? Then if such is the case, the result could be written in a file that the JobAgent could read to add value to CPUWorkLeft.

fstagni commented 4 years ago

If I understand correctly your comment, the current definition of (C)PUTimeLeft in the code would be (C)PUWorkLeft as it is actually in HS06s. Yes. We have not been very good with naming things and this is one of the many points where we can apply some clarity, that's why I listed the definitions above. And I would define (C)PUTime as the user+system time in seconds. Is that really equivalent to WallClockTime? Doesn't it depend on the nature of the payload and the number of processors available (cpu-bound or I/O-bound task)? If it's real seconds IMHO we should not count the number of CPUs, but some batch system may have different ideas (indeed, see SLURM).

Your TimeLeft formulas look OK to me.

There is at the same time something that I still don't grasp in your definitions and calculations for the case of multiprocessor resources and jobs, so I go for an example:

There are 3 jobs in the queue:

The Watchdog is not helping in this case I am afraid.

phicharp commented 4 years ago

Dear all,

Jumping a bit late into the game, so a bit hard to recover, but I think all the comments made so far are sensible... I am adding here some historical background and personal considerations...

IMHO, we should first take the definitions Federico recalled and fix all variable names in the code, including JDL parameters, keeping backward compatibility, if feasible (using double parameters would not harm for a while provided it is really for a while!)...

We have to deal with:

I would go as far as not specifying any longer HS06 as the current DIRACBenchmark is not really scaling any longer with the old deprecated HS06... but please consider that BDII still uses SI2k (a.k.a. SI00) as benchmark, while HS06 has been adopted by WLCG in 2007... FYI the conversion is a plain factor 250 which is indeed a rounding of a measured 239 at the time, but it was making 1 kSI2K = 4 HS06 which was simpler...

B.t.w. we are not even using DB12, as in order to rescale DB to match more or less HS06, I have introduced a scaling factor Operations/JobScheduling/CPUNormalisationCorrection that is currently set to 0.65 (was set in 2016, hence leading to DB16 unit ;-) )

The original hope was that in addition to DB12/16 we would get generalised so-called MJF (Machine and Jobs Features) everywhere that would provide an agreed benchmarking of the local CPU (in whatever unit, there was discussion about an HS14, or using DB12 or...) in machineFeatures, as well as CPU and WallClock time available to the job in JobFeatures. This was in order to avoid everybody to develop plugins for each and every batch system, and was a way to convey the information for non-batch WNs like VMs or containers (in which case the MJF information would be provided using http).

As more batch systems became available and sites started to use specific features of the existing batch systems, the plugins in DIRAC became essentially unmaintainable and were really unmaintained... LSF is a nightmare as is Grid Engine due to the many parameters that system managers can introduce...

Concerning CPUTime and WallClockTime, again there was never a consensus in WLCG on how to define those, although it sounds rather logical to have a limit in CPUTime < the limit in WallClockTime since the CPU efficiency (CPUTime/WallClockTime should always be less than 1, caveat the bit of multithreading that is unavoidable in all applications).

The proposal (that I made centuries ago) was to have a CPU limit (in real seconds) on whicc we could evaluate the available CPUWork and match only jobs that require less work than that limit. And to have a WallClockTime limit much higher just in order to avoid idle jobs to keep a slot forever, i.e. for example 50% larger than CPUTimeLimit. If the job behaves normally it would hit CPUTimeLimit before WallClockTimeLimit (that was not agreed by all experiments as several had <30% efficiencies :()

In the real world, a lot of sites set the CPUTimeLimit == WallClockTimeLimit, which means that under normal circumstances WallClock is hit before CPU... One way out would be to assume jobs are well-behaving and have an efficiency ~80-90%, thus scale down the CPUWork capacity of the slot. Anyway due to the fuzziness of the power measurement (related to many factors), one has to keep a safety merging of the order to 20-30% to match jobs (and CPUWork requirements of jobs are also usually evaluated with a safety margin, except for MCSimulation where the number of events is derived from the capacity of the slot, but also with a 30% safety margin).

All the above stands for single processor slots. For multi-processor slots, not much difference except that the CPUTime limit should be scaled (approximately) with the number of processors, as alluded to by Alexandre... Of course WallClockTime remains what it is ;-).

Another remark with the absurd parameters CPUScalingFactor and CPUNormalisationFactor... The existence of these 2 parameters is historical:

  1. The names come from the fact that nobody at the time was talking about power, but about "normalisation" of time (do you normalise time to measure your electricity consumption?)
  2. IIRC Normalisation was used and Scaling was a second parameter introduced by the author in order to be able to play around with other ways to "normalise" time

I abused this when MJF was tentatively introduced and was using ScalingFactor for reporting the MJF power when available. When it was not available it was set == NormalisationFactor and the trick to know whether MJF was there was that the DB12/16 power was given with a single digit while the MJF was given with 2 digits... This is awful, but introduced minimal code changes... All this should of course be fixed as DB12Power and MJFPower in the code and in the local CS parameters...

B.t.w. this is also an ugly feature of this utility to pass those parameters as being CS parameters... It works but is rather confusing, as one never really knows what is made persistent in a local .cfg file and what is only in memory.... one gets used to it but it is not very clear...

I hope this is not introducing more confusion than needed to an already extremely confusing item...

Of course there is always the ATLAS/CMS option: don't care and if jobs fail, they fail... They always assume jobs can run on any slot, even in cases when they use filling mode (but I think they abandoned that)... B.t.w. it is not clear to me whether using filling mode is really very useful if jobs have a reasonably long duration, and due to all the problems observed with the CPUTimeLeft utility, at many sites we are not using filling mode tr just 2 jobs ans "tant pis" if they don't fit!

Note that, as Andrei pointed out, users usually have not a single clue of what to put as a requirement and they just rely on the ganga default!

phicharp commented 4 years ago

I see in the code that there are already variables like DB12 and DB12Measured in the local CS, set by the dirac-wms-cpu-normalization script... Already a step forward ;-)

fstagni commented 4 years ago

Thanks for your comments Philippe. A couple notes:

IMHO, we should first take the definitions Federico recalled and fix all variable names in the code, including JDL parameters, keeping backward compatibility, if feasible (using double parameters would not harm for a while provided it is really for a while!)...

Yes yes yes!

B.t.w. we are not even using DB12, as in order to rescale DB to match more or less HS06, I have introduced a scaling factor Operations/JobScheduling/CPUNormalisationCorrection that is currently set to 0.65 (was set in 2016, hence leading to DB16 unit ;-) )

This is in LHCb, for everyone else it's still DB12. Or, we could adjust directly in the code. But, as mentioned below, DBxx is thought for a single-processors era.

B.t.w. it is not clear to me whether using filling mode is really very useful if jobs have a reasonably long duration, and due to all the problems observed with the CPUTimeLeft utility, at many sites we are not using filling mode tr just 2 jobs ans "tant pis" if they don't fit!

Even before using filling mode (which is a JobAgent option), there's a "hidden filling mode" in Pool CE which is what is used for multiprocessor nodes (and tricky for many-core architectures where we "play tetris".

aldbr commented 4 years ago

So, if I try to sum up what we said:

To get the right job to run on a Worker Node, we need:

Time: Time left estimation in real seconds that we can get from the LRMS but:

Questions:

It is not clear to me whether using filling mode is really very useful if jobs have a reasonably long duration, and due to all the problems observed with the CPUTimeLeft utility

  • Do we have an idea of the time wasted, that was allocated but not used due to the filling mode disabled? Of the time a pilot take to go from "submitted" to "running"?

Considered solutions:

Power: Dirac Benchmark (DBXX) aims at reproducing HS06 but:

Questions:

Considered solutions:

Short term solution: CPU Time computation (in TimeLeft):

factor = 0.8 (?)
dependsOnWallClockTime = False
cpuTimeLeft, wallClockTimeLeft = getCPUWallClockTimeLeft()

if wallClockTimeLeft * factor < cpuTimeLeft:
  # In this case, wallClockTimeLeft may influence cpuTimeLeft and has to be taken into account
  if wallClockTimeLeft * factor >= cpuTimeLeft / nbProcs:
    # Here, we use a small portion of the cpuTime to avoid fetching jobs that could not run within wallClockTimeLeft
    cpuTimeLeft = cpuTimeLeft / nbProcs
  else:
    # Here, wallClockTimeLeft is too low to be ignored, we use it to fetch the jobs
    cpuTimeLeft = wallClockTimeLeft * factor
    dependsOnWallClockTime = True

cpuWorkLeft = cpuTimeLeft * cpuPower

Example:

1000000 * 0.8 < 10000 is False

In this case, wallClockTime is so large that it can be ignored, we would use the original cpuTimeLeft value.

- 2nd case:

wallClockTimeLeft = 10000 cpuTimeLeft = 1000000 nbProcs = 4

10000 0.8 < 1000000 is True 10000 0.8 >= 1000000 / 4 is False cpuTimeLeft = 10000 * 0.8

wallClockTimeLeft is too small compared to cpuTimeLeft to be ignored, it will certainly hit 0 before it and have to be taken into account.

- 3rd case:

wallClockTimeLeft = 10000 cpuTimeLeft = 10000 nbProcs = 4

10000 0.8 < 10000 is True 10000 0.8 >= 10000 / 4 is True cpuTimeLeft = 2500

wallClockTimeLeft is similar to cpuTimeLeft, but we cannot use it as we have multiple processors. 2 jobs, requiring 1 processor each, and `wallClockTimeLeft * factor` seconds would fail (8000 + 8000 > 10000).

*CPU Power*:

No change for the moment.

Example:

Let's use your example, @fstagni . 
If we don't change anything, CPUPower computed is 10 for 1 processor, and we keep it like this.
Thus, `CPUPowerLeft = 100 * 10 = 1000`.

* jobSP: run it `(1000 == 1000)`
* jobMP2: doesn't run it `(3000 > 1000)` 
* jobMP4: doesn't run it `(3000 > 1000)`, but in theory it could
* jobSPMP: run it `(500 < 1000)`

*CPU Work computation (in `JobAgent`)*:

cpuWorkLeft, dependsOnWallClockTime = getCPUWorkLeft() if not dependsOnWallClockTime and nbProcessors > 1:

In this case, we use a PoolCE, we cannot trust cpuWorkLeft as some jobs may be running at the same time

cpuWorkLeft -= jobCPU



Any comment on the *considered solutions*? On the short term solution?
fstagni commented 4 years ago

Good summary Alexandre.

Time:

Do the plugins return an incorrect time left result because of the specific features of the batch systems? Because of the output that is not easily parseable?

Because of the output that is not easy to parse

If plugins do not return the right CPUTimeLeft value, do the elastic MC simulation jobs work "by chance"?

Most of the time they work because:

It is not clear to me whether using filling mode is really very useful if jobs have a reasonably long duration, and due to all the problems observed with the CPUTimeLeft utility

For Single Processor jobs in Single Processor job slots, filling mode has limited utility. For the case of MultiProcessor job slots, it's basically a must from the day we start to "play tetris"

Do we have an idea of the time wasted, that was allocated but not used due to the filling mode disabled? Of the time a pilot take to go from "submitted" to "running"?

We have never recorded these values anywhere.

I am afraid that the only solution is maintaining the current plugins. Mostly the SLURM one as this is what's used in HPCs.

Power:

Why did we need a CPUNormalisationCorrection?

Not to scale with HS06, but to scale with the LHCb applications (namely Gauss)

Which main factors can impact the power measurement?

Please ask Andrea Valassi (CERN) as he's LHCb liason in the Benchmarking WG of WLCG.

Improve DBXX to work with multi-processors: Would it be hard?

No idea...

Right now we can ~reliably only stick with DB12. There are no other ready-to-use solutions on the horizon, AFAIK. But again, ask Andrea Valassi.

Short term solution

Go ahead with it.

aldbr commented 3 years ago

Update

The original issue was related to the use of Slurm, which is a batch system only based on wallclock time, in multi-core environments. We proposed a SlurmResourceUsage (https://github.com/DIRACGrid/DIRAC/pull/4482, https://github.com/DIRACGrid/DIRAC/pull/4673) plugin able to compute an accurate value of the wallclock time left in an allocation, and we use the value to compute the cpu work afterwards. Slurm includes a CLI very easy to parse, and allows us to execute multiple SP or MP jobs in parallel.

We also introduced an option to reduce the duration of the JobAgent cycles (https://github.com/DIRACGrid/Pilot/pull/106), so that the "cpu work left" value remains almost correct between the end of a cycle and the beginning of a new one (no need for further modifications in the JobAgent).

The original issue is then resolved; but we also raised many other points at the same time: