dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Intorduce Floating/adjustable margins for MaxPSS #9545

Open todor-ivanov opened 4 years ago

todor-ivanov commented 4 years ago

Impact of the new feature WMAgent

Is your feature request related to a problem? Please describe. Follow up from this jira ticket [1].

While having the maxPSS value setup and watched by PerformanceMonitor.py [2] is a good idea in general, it may lead to resource under-utilization and decrease in CPU efficiency for some workflows like the one in the above mentioned jira ticket. For different reasons, those workflows tend to have plenty of jobs which are exceeding the already estimated and configured value for PSS at request configuration/submission time. This is the value used for initial job scheduling and so being later monitored by this script and used for sharp killing of the jobs at the exact value of maxPSS. Which basically leads to the aforementioned results and sometimes even prevent the workflows from completing at all.

Related PRs [3]

[1] https://its.cern.ch/jira/browse/CMSCOMPPR-11452 [2] https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMRuntime/Monitors/PerformanceMonitor.py [3] https://github.com/dmwm/WMCore/pull/8204

Describe the solution you'd like Just increasing the maxPSS with a hard coded excess in order to cover the memory consumption need only for a fraction of the jobs, will again lead to decrease of CPU efficiency, since this new value will enter the resource requests and all the jobs will be negotiated for that higher PSS value from the very beginning but not using all the resources at the end.
Introducing a soft and not fixed/adjustable margin around the already configured value and let the job being still negotiated with the configured PSS, but killed by the wrapper only once they reach maxPSS + n% may fix the problem. It may also allow to have future more sophisticated methods for tuning this value during the workflow lifespan. This will work only provided that the pilot will not kill the job with a sharp threshold for the PSS value in a similar way as it is done in PerformanceMonitor.py.

Describe alternatives you've considered A static increase of the maxPSS value and searching for a good compromise between the two (negative) factors that will affect the CPU efficiency - initially high resource request vs. lost CPU time due to killed jobs.

Additional context N/A

klannon commented 4 years ago

I just discussed with @todor-ivanov briefly. I don't think we should implement this change without some careful thinking. Allowing a running job to exceed the request's maxPSS is dangerous. If too many jobs do this, the hosts running the job could run out of memory and we could end up with thrashing machines that need to be rebooted across the globe. A much safer approach is described in [1]. Note, some of the authors of [1] (@btovar, @dthain) are working on CMS WM R&D so, if there is interest in working towards this, we could investigate making a plan to try something like this.

[1] https://ieeexplore.ieee.org/document/8066333, http://ccl.cse.nd.edu/research/papers/Tovar-job-sizing-TPDS2017.pdf

todor-ivanov commented 4 years ago

Hi @klannon the work in this paper is simply awesome. I've always been observing the problem from the same perspective and I've always thought that all optimizations should be done based on a solid research and well defined model. While writing this issue I was trying to define the simple use case we have here and at the same time keeping it open for work in that exact direction as described in this paper. I was simply restraining myself from making any concrete suggestions, so thanks for pasting this work here. I honestly think we should follow this path.

amaltaro commented 4 years ago

I replied to the JIRA ticket. However, I can make a short comment in here too. First, let me say that the job matchmaking is based on the Memory requirements of the job, MaxPSS is simply the metric used by the performance watchdog.

Allowing jobs to go above the request memory (by some magic fraction) isn't a good choice IMO. The best would be to adapt job requirements during the lifetime of a workflow. If we are being hit pretty badly in terms of resource usage, then the best would be to abort the workflow and ask for a resubmission (with the proper Memory requirements).

todor-ivanov commented 4 years ago

Putting this comment here now, just to relate the following two issues since they are attacking one and the same problem [1]. Thanks to @amaltaro for pointing that to me today. FYI @klannon

[1] https://github.com/dmwm/WMCore/issues/8622 https://github.com/dmwm/WMCore/issues/8646

amaltaro commented 4 years ago

@todor-ivanov can you please update this issue with the outcome of the meeting? We discussed it yesterday, but I'm certainly not the best person to write it down

aperezca commented 4 years ago

Alan, there's minutes for the meeting https://indico.cern.ch/event/892172/ also a document that James started is being discussed and commented in SI (but I'll let James share it more widely when ready). In (my) summary, the discussion is not over, we still need to completely evaluate how much memory we are using compared to what we are requesting (jobs and pilots, see also the slides from Marco). then understand how flexible we can be. From the technical perspective, we continued the discussion with the condor developers in the following hour. We discussed we could make the RequestMemory an expression (it already is for resizable jobs), to that memory depends on the number of retries, for example: memory = base memory + delta * n_retries IF retries less than n or something like that. Base memory should be the average peak memory, rather than the tails of the distribution, as it happens mostly now.