dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Jobs escape MaxWallTimeMins and gets evicted once pilots reach natural endlife #11524

Open mmascher opened 1 year ago

mmascher commented 1 year ago

Impact of the bug This has been noticed in the CPU efficiency Matter Most channel. The workflow pdmvserv_Run2022G_JetMET_19Jan2023_230119_090615_3522 had a 30% of CPU time waste due to jobs that were evicted (MaxWallTimeMins was 8 hours but average runtime was 20hours).

Describe the bug Jobs run longer than their declared MaxWallTimeMins and they get killed if they are scheduled on a pilot that has not enough time left causing CPU inefficiencies.

Expected behavior WMAgent should catch a job exceeding the declared walltime and kill it so a nice error could be reported and workflows can be killed. Otherwise nobody notice this since evicted jobs get transparently rescheduled.

amaltaro commented 1 year ago

@mmascher thanks for creating this issue.

On top of what has been (is being) discussed in Mattermost, I wanted to clarify that WMAgent defines the resource requirements according to the workflow description, which comes from McM. So we can roughly say that each WMAgent_SubTaskName would have similar resource requirements.

I say "similar" here, because in the end it all depends how many events make it to a job, affecting the estimated wallclock time (and disk requirements). Cores and memory should remain untouched though, unless there is qedits being done by someone.

Maybe the best short term commitment that we could have is to:

In the long run, we need to have some sort of feedback loop mechanism to update workflow requirements on the fly.

mmascher commented 1 year ago

The issue is that when a job gets evicted because the pilot has to shutdown then you have a condor restart. Meaning from the WMAgent perspective the same condor job id goes idle/running/idle again. So you can't adjust jobs runtime on the flight.

mmascher commented 1 year ago

Thinking a bit more about this.

@mmascher thanks for creating this issue.

On top of what has been (is being) discussed in Mattermost, I wanted to clarify that WMAgent defines the resource requirements according to the workflow description, which comes from McM. So we can roughly say that each WMAgent_SubTaskName would have similar resource requirements.

I say "similar" here, because in the end it all depends how many events make it to a job, affecting the estimated wallclock time (and disk requirements). Cores and memory should remain untouched though, unless there is qedits being done by someone.

Maybe the best short term commitment that we could have is to:

* spot when a job is evicted because the pilot needs to shutdown, if possible

It is possible to do this if you check and fail a job before it gets to the end of the pilot. Something like: (MaxWallTimeMinsRun*60 < (time() - EnteredCurrentStatus)) && ((time()+1800) > GLIDEIN_ToDie) Basically you remove the job if it is going over the walltime AND the pilot does not have much time left.

If you prefer you can do the equivalent using the WMAgent watchdog or whatever. The job has to shut itself down (possibly through periodicRemove like CRAB) and not trigger an eviction. Evictions are bad bad bad.

* evaluate what was the remote wallclocktime

* automatically condor resubmit that job with a new wallclocktime required (making a copy of the original one)

* if needed, apply a multiplying factor (and ceiling - 48h)

* and this would create a job now with more realistic wallclocktime.

I like all of this. Once the job "failed itself" then you resubmit it with more walltime.

In the long run, we need to have some sort of feedback loop mechanism to update workflow requirements on the fly.

Agree, we can do this in a second step. Any "automatic on the flight" walltime tuning (based on jobs that recently finished from the same task) is prone to error and can still benefit from the above mechanism.

amaltaro commented 1 year ago

So you can't adjust jobs runtime on the flight.

@mmascher Marco, given that the job eviction happens within glideinWMS, would it be possible to add/modify a condor classad in the NumShadownStart++?

If we were to make it from the agent side, such jobs would have to go through the failure process (job getting out of condor, being processed by a few components, finding out that the requirements need to be modified, being resubmitted to condor)

mmascher commented 1 year ago

My objective is to extract data from Condor and properly process it with WMAgent. I hope that with proper error reporting, we can tune walltime estimations, kill workflows when estimations are far off, and improve job efficiency. Am I being too optimistic?

Looking at 0254, the situation is as follows (True indicates jobs that went over the walltime; jobs are grouped by restarts 1, 2, and the rest):

[mmascher@vocms0254 public]$ condor_history -const 'NumShadowStarts==1' -af 'INT(RemoteWallClockTime/60)>MaxWallTimeMins'| sort | uniq -c
  50145 false
   8841 true
[mmascher@vocms0254 public]$ condor_history -const 'NumShadowStarts==2' -af 'INT(RemoteWallClockTime/60)>MaxWallTimeMins'| sort | uniq -c
   3112 false
   2332 true
[mmascher@vocms0254 public]$ condor_history -const 'NumShadowStarts>2' -af 'INT(RemoteWallClockTime/60)>MaxWallTimeMins'| sort | uniq -c
    560 false
    594 true

Increasing the walltime after the first failure only saves the 1,000 jobs with NumShadowStarts>2, assuming all the evictions are due to walltime and the jobs succeed on the second try.

However, 15,000 jobs with walltime estimates are inaccurate, so we need to improve our estimation methods. I hope that job failures and requestor-tuned walltime parameters will help achieve this.

Of course, if we fail too much, we risk having jobs request too much walltime (e.g., 36 hours), which is not ideal. And we still want to allow the occasional nasty job to exceed the walltime since it may have peculiar inputs or slow sites. That's why I favor not killing jobs as soon as they exceed the estimated walltime.

belforte commented 1 year ago

let me offer my poor wisdom:

  1. have a MaxWallTime as sanity check against things get stuck in read of lost in cpuloops, especially harmful if job starts on a fresh pilot and can run astray for 2 days + if not killed (and reportec as 5066x = bad guy)
  2. have an EstimatedWallTime to schedule within lifetime of pilot
  3. we have talked for eons about how to do 2. I think the best strategy is indeed to try based on e.g. piilot time >= EstimatedWallTime*1.5. (or even Estimanted/2. if we want to make sure we fill all pilot tails, and go for max CPU use)
  4. as Marco indicated, if payload is still there when pilot is about to terminate, remove with a reason and exit code
  5. when that code /reason is detected, resubmit with a longer ExtimatedWallTime
  6. It would be great if SI could do 3+4+5 internally and we (CRAB+WMA) only do 1. and 2.
  7. as a start, to keep it simple, we can keep indicating MaxWallTime in the submission and start witj assuminh Estimated = Max/3. And Alan should not make Max too short intentionally

In the end, the only problem which I see in this context is to avoid running stuff bound to die when we could have done useful work instead (e.g. sneak in some lower prio analysis job or some lower prio production which has a shorter time to run).

khurtado commented 1 year ago

@mmascher @amaltaro @todor-ivanov In order to check the impact of this in the system, I made a plot with the number of starts per production job.

It seems 97% are done in one try, the other 3% is mostly done in a second try. NumJobStarts>=3 are negligible (rounded to 0% by grafana) overall in the last year according to the plot below:

https://monit-grafana.cern.ch/d/ifXAfjLVk/production-jobs-exit-code-monitoring?orgId=11&from=1663790229795&to=1695326229795&viewPanel=102

Considering that, how important (in terms of priority) would you think this is to address?

One idea that came to my mind if GlideinWMS accepts dynamic values for MaxWalltimeMins is doing something like:

MaxWallTimeMins = ifthenelse(JobDuration =!= undefined && JobDuration > MaxWallTimeMins,
MaxWallTimeMins * 3/2)

So that the job dynamically increases the MaxWallTimeMins value after the first job start try, if the job duration was greater than the max allowed walltime. JobDuration requires +WantIOProxy to be added to the job, if I recall correctly, and is populated only after a job was evicted (or finished, but then it would be in the condor history and not in the queue anymore).