dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Revisit job resizing based on the unified logic #9872

Open amaltaro opened 4 years ago

amaltaro commented 4 years ago

Impact of the new feature WMAgent (and Unified)

Is your feature request related to a problem? Please describe. We would like to move the job resizing logic to a single place, and leave only one action to Unified, which is to enable job resize once jobs are sitting pending in Condor.

Describe the solution you'd like Here is where the job resizing is implemented in Unified: https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/equalizor.py#L455-L492 in plain english, something like:

While the logic in WMAgent is here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L558-L570 in plain english, something like:

So, we need to come up with:

Once that is live in WMAgent, Unified needs to be modified to simply enable the job resizing, without modifying anything else in the job classads.

Describe alternatives you've considered Do not change anything and keep things as they are.

Additional context This has been quickly discussed in the CompOps meeting today.

FYI @sharad1126 @haozturk @todor-ivanov @khurtado @aperezca

hufnagel commented 3 years ago

Given that we will be getting more resources with whole node pilots (mostly HPC, but they are also used at some CMS sites), we should definitely allow jobs to resize up. Not sure about the maximum though. A fixed value is certainly easier to implement, but if you think about it the maximum you want really is workflow type / CMSSW dependent.

drkovalskyi commented 3 years ago

I think it would be good to clarify use cases. Unless we implement efficiency checks, we can easily get into making it all look nice from a resource utilization point of view while having worse throughput in terms of produced events. This is especially critical for StepChain requests that may have acceptable efficiency running with 4-cores or less for example and certainly not with 15-cores.

klannon commented 3 years ago

I don't see how the WM system could automatically figure out what range of cores produces acceptable efficiency on any reasonable time frame (i.e. implementable this year). Therefore, I think this range would have to be input with the request by PPD/PdmV.

dpiparo commented 3 years ago

a bit late, but I agree with @klannon . These efficiencies are measured by pdmv through the process they call "multivalidation". If the list of (nr. cores, cpu efficiency) pairs is compiled by them ad added to the request description, could this be of help?

todor-ivanov commented 3 years ago

@dpiparo this is actually in a big interest of ours, to have some more information about how the actual efficiency of the payload jobs (CMSRun processes) are measured. We are trying to prepare some metrics on our side - when it comes to the workflow management system, but at the end we need to know also how it is estimated for the actual consumer of the resources itself.

dpiparo commented 3 years ago

I cannot add him to this exchange, but the reference person would be Jordan Martins. PdmV and SI are collaborating to be able to improve the "multivalidation" (aka figure out the dependency of the efficiency on the number of cores) to allow it to be treated as a regular production job (right now it's made of regular batch jobs @ cern)

haozturk commented 3 years ago

Let me add @jordan-martins in the loop

jordan-martins commented 3 years ago

Hi guys, concerning what we do over multivalidation job, you may check [1]. Let me try to summarize or tell the story:

  1. we do a cmsRun like this cmsRun -e -j XXX-RunIISummer20UL16GEN-00001_report.xml XXX-RunIISummer20UL16GEN-00001_1_cfg.py -> the idea while doing this is that we use the xml to store several monitoring variables available from the CMSSW framework.
  2. as a first step, we deploy 1 validation job intended to estimate the CPU eff of 1 core and 2GB memory (this memory is actually just what we request at the queue). From [1], you may see what are the exactly variables we use to measure the CPU eff.
  3. If the CPU eff is > 70 %, we deploy 3 more validation jobs, 1 for 2 cores (4GB requested to the job), 1 for 4 cores (8GB requested to the job) and 1 for 8 cores (4GB requested to the job). We change the cores by re-running the Driver and choosing the cores in a one-by-one relation to the cfg.py file - like XXX_2_cfg.py, XXX_4_cfg.py and XXX_8_cfg.py. Then, we take highest number of cores with CPU efficiency >= 70%.
  4. While sending the dict thru the reqmgr, we send the actual memory size corrected by a rule table to avoid request less memory than what it will need during the 8hr condor job.

This is also in the best interest of PdmV since, roughly speaking, we kind of estimate that the whole MC production has ~50% overall eff in terms of the requested memory. With multivalidation procedure, we try to cover the first "step" (the root ones). But then once the GEN step is finished the others get the resized parameters from Unified once it has it set to true in the campaigns configuration.

Please, let us know if there are more that we can grasp regarding this.

Thanks, Jordan

[1] https://github.com/cms-PdmV/cmsPdmV/blob/master/mcm/automatic_scripts/validation/validation_control.py#L761-L836

todor-ivanov commented 3 years ago

Hi @jordan-martins. That is actually quite useful. Thanks a lot. Few more things I need to understand:

  1. While sending the dict thru the reqmgr, we send the actual memory size corrected by a rule table to avoid request less memory than what it will need during the 8hr condor job.

First, what is this table? On what basis is this one estimated? What I understand is these are statically fixed correction factors which you apply on top of what ever you have already established by the process explained above. Second, you mention the 8hr condor job, but you did not mention how/if you are trying to target that wall time exactly. I know the basic job splitting is meant to happen inside WMCore, but on your side, are you doing anything in particular to measure an eventual CMSRun run time in advance?

But then once the GEN step is finished the others get the resized parameters from Unified once it has it set to true in the campaigns configuration.

Are you telling me that if we want to measure the efficiency of this multivariate algorithm itself (which to me seems quite well established) we should aggregate only over GEN jobs, because for the rest of the steps, the values prescribed are already obfuscated by the Unified algorithm used re-calculate them?

jordan-martins commented 3 years ago

Hi @todor-ivanov ,

you may find our currently table expressed in [1] (take notice that the factorization is not the same. It differs from one memory region to another. Check excel equations). Basically, while doing the validation jobs, we allow a time/evt window of opportunity to the users - namely the MC contacts - (a +- 25% mismatch of the time/evt - mainly because there are differences while running in lxplus and so on). There is a game that happens in the validation job that one must be alert... if we can do only a 8hr condor job, the main players are: time/event and filter efficiency. Filter efficiency is what we measure once we request some kind of monte carlo generation and within the fragment itself we set what we call phase of space. Then, if the event generated is within that space, it counts as a good event. If not, we discard and start another event generation. This is the reason why we have the table [1]. Namely, this 25% gap makes us not be able to give a more precise value of memory because when we do an estimation of the amount of events that would fit the 8hr, this will be an approximation number (more like an expected value from the input variables). And, as far as I am aware, the memory usage of an 8 hr condor job is a function of the amount of events generated, e.g. more events therefore more memory. We understand that this may yet be better assessed, but this has being working quite well now.

About the wall time... if for some reason the user makes a bad estimation of the time/evt, then it would lead McM to expect a certain number of events that would not hold in the 8hr condor job. Then, the user gets a mesage saying that his job was terminated but it was not completed and, therfore, a better time/evt needs to be provide (this happens more or less the same if the filter eff is bad inputed in McM).

Our number of events (more or less what you mean about the spliting) is in [2].

About the last comment... my point is that we control only the GEN part, the very first step. When the wf goes in, it will start the GEN taking the parameters we measure from multivalidation. Let me give you a regular example of what happens in UL campaigns. We have the following steps that are connected thru flows and are brought to life via chains: GEN > SIM > DIGIPREMIX > HLT > RECO > MINIAOD > NANOAOD. PdmV only controls memory and core from the GEN step. The others, if defined in campaign to have resize as true - which is the most of the cases, will do whatever is defined as listed in the description of this JIRA. I guess that the main question lies on how to Unified can improve the management of the memory. I read a 60% number. This seems to high. Remeber that once the GEN step is done, there is no more filter eff. Whatever the amount of events passed from there will remain as a constant thru the chain.

I think (?) there are some parameters that you guys take from us that define more or less how you do the split after the GEN step. From our side, they are defined in the flows. e.g we probably send to you both the time/evt and the size/evt that Unified takes into account to do the split. If those parameters are defined wrongly, then it leads to unmerge problems.

The above brings me to.... Naively, thinking out loud - not sure if this holds, one could think that each CMSSW release doing a particular step would not vary so much in terms of the memory while looking at a normalized number. I mean the following... if one takes the 10_6_20, for example, then try to check the average of the memory increment by each event for a given step (SIM or DIGIPREMIX or ...), this would be this normalization number that you would need to better estimate the needed memory with rely on any raw estimation. Of course, that 11_2_X could bring different numbers and so on... but instead of asking for resize, one could set a new variable that could translate the array [release,step] to this normalized number.

Not sure if I made it more confusing... :s Best, Jordan

[1] https://docs.google.com/spreadsheets/d/1ej_q7a-Mv2a9BZN8dQzvw5qHKEqftHJKfZXgFwL1J9M/edit#gid=940118300 [2] https://github.com/cms-PdmV/cmsPdmV/blob/master/mcm/json_layer/request.py#L2530-L2578

klannon commented 3 years ago

I'm worried this discussion is getting too convoluted. I think the minimum piece of information that needs to be communicated from PdmV is just the minimum and maximum number of cores that a request can run with and still obtain an acceptable CPU efficiency. If I follow correctly, the efficiency over a range of cores is checked in multivalidation, so all that remains is just to decide what the threshold for "acceptable" efficiency is and the pass the minimum and maximum number of cores in as part of the request so that the WM system can make use of it.

Regarding "extra memory" is that just some padding that gets provided to try to provide a safety margin for the job? I'm confused about what we're actually trying to accomplish with that.

amaltaro commented 3 years ago

There were questions regarding this topic in today's CompOps meeting and I think we should try to converge on the WMCore + Unified implementation and add it to WMAgent, such that we can start using this feature again in the system, even if it's not in an optimal state yet.

Then in the future, we can do some R&D and add some intelligence and feedback loop to this resizeable job feature (in a different GH issue).

Please let me know what your thoughts are.

amaltaro commented 3 years ago

Defined as a low-medium priority, according to Hasan's input on the Q3 planning