Open mmascher opened 1 year ago
@mmascher thanks for creating this issue.
On top of what has been (is being) discussed in Mattermost, I wanted to clarify that WMAgent defines the resource requirements according to the workflow description, which comes from McM. So we can roughly say that each WMAgent_SubTaskName would have similar resource requirements.
I say "similar" here, because in the end it all depends how many events make it to a job, affecting the estimated wallclock time (and disk requirements). Cores and memory should remain untouched though, unless there is qedits being done by someone.
Maybe the best short term commitment that we could have is to:
In the long run, we need to have some sort of feedback loop mechanism to update workflow requirements on the fly.
The issue is that when a job gets evicted because the pilot has to shutdown then you have a condor restart. Meaning from the WMAgent perspective the same condor job id goes idle/running/idle again. So you can't adjust jobs runtime on the flight.
Thinking a bit more about this.
@mmascher thanks for creating this issue.
On top of what has been (is being) discussed in Mattermost, I wanted to clarify that WMAgent defines the resource requirements according to the workflow description, which comes from McM. So we can roughly say that each WMAgent_SubTaskName would have similar resource requirements.
I say "similar" here, because in the end it all depends how many events make it to a job, affecting the estimated wallclock time (and disk requirements). Cores and memory should remain untouched though, unless there is qedits being done by someone.
Maybe the best short term commitment that we could have is to:
* spot when a job is evicted because the pilot needs to shutdown, if possible
It is possible to do this if you check and fail a job before it gets to the end of the pilot. Something like:
(MaxWallTimeMinsRun*60 < (time() - EnteredCurrentStatus)) && ((time()+1800) > GLIDEIN_ToDie)
Basically you remove the job if it is going over the walltime AND the pilot does not have much time left.
If you prefer you can do the equivalent using the WMAgent watchdog or whatever. The job has to shut itself down (possibly through periodicRemove like CRAB) and not trigger an eviction. Evictions are bad bad bad.
* evaluate what was the remote wallclocktime * automatically condor resubmit that job with a new wallclocktime required (making a copy of the original one) * if needed, apply a multiplying factor (and ceiling - 48h) * and this would create a job now with more realistic wallclocktime.
I like all of this. Once the job "failed itself" then you resubmit it with more walltime.
In the long run, we need to have some sort of feedback loop mechanism to update workflow requirements on the fly.
Agree, we can do this in a second step. Any "automatic on the flight" walltime tuning (based on jobs that recently finished from the same task) is prone to error and can still benefit from the above mechanism.
So you can't adjust jobs runtime on the flight.
@mmascher Marco, given that the job eviction happens within glideinWMS, would it be possible to add/modify a condor classad in the NumShadownStart++?
If we were to make it from the agent side, such jobs would have to go through the failure process (job getting out of condor, being processed by a few components, finding out that the requirements need to be modified, being resubmitted to condor)
My objective is to extract data from Condor and properly process it with WMAgent. I hope that with proper error reporting, we can tune walltime estimations, kill workflows when estimations are far off, and improve job efficiency. Am I being too optimistic?
Looking at 0254, the situation is as follows (True indicates jobs that went over the walltime; jobs are grouped by restarts 1, 2, and the rest):
[mmascher@vocms0254 public]$ condor_history -const 'NumShadowStarts==1' -af 'INT(RemoteWallClockTime/60)>MaxWallTimeMins'| sort | uniq -c
50145 false
8841 true
[mmascher@vocms0254 public]$ condor_history -const 'NumShadowStarts==2' -af 'INT(RemoteWallClockTime/60)>MaxWallTimeMins'| sort | uniq -c
3112 false
2332 true
[mmascher@vocms0254 public]$ condor_history -const 'NumShadowStarts>2' -af 'INT(RemoteWallClockTime/60)>MaxWallTimeMins'| sort | uniq -c
560 false
594 true
Increasing the walltime after the first failure only saves the 1,000 jobs with NumShadowStarts>2, assuming all the evictions are due to walltime and the jobs succeed on the second try.
However, 15,000 jobs with walltime estimates are inaccurate, so we need to improve our estimation methods. I hope that job failures and requestor-tuned walltime parameters will help achieve this.
Of course, if we fail too much, we risk having jobs request too much walltime (e.g., 36 hours), which is not ideal. And we still want to allow the occasional nasty job to exceed the walltime since it may have peculiar inputs or slow sites. That's why I favor not killing jobs as soon as they exceed the estimated walltime.
let me offer my poor wisdom:
In the end, the only problem which I see in this context is to avoid running stuff bound to die when we could have done useful work instead (e.g. sneak in some lower prio analysis job or some lower prio production which has a shorter time to run).
@mmascher @amaltaro @todor-ivanov In order to check the impact of this in the system, I made a plot with the number of starts per production job.
It seems 97% are done in one try, the other 3% is mostly done in a second try. NumJobStarts>=3 are negligible (rounded to 0% by grafana) overall in the last year according to the plot below:
Considering that, how important (in terms of priority) would you think this is to address?
One idea that came to my mind if GlideinWMS accepts dynamic values for MaxWalltimeMins
is doing something like:
MaxWallTimeMins = ifthenelse(JobDuration =!= undefined && JobDuration > MaxWallTimeMins,
MaxWallTimeMins * 3/2)
So that the job dynamically increases the MaxWallTimeMins value after the first job start try, if the job duration was greater than the max allowed walltime. JobDuration requires +WantIOProxy
to be added to the job, if I recall correctly, and is populated only after a job was evicted (or finished, but then it would be in the condor history and not in the queue anymore).
Impact of the bug This has been noticed in the CPU efficiency Matter Most channel. The workflow
pdmvserv_Run2022G_JetMET_19Jan2023_230119_090615_3522
had a 30% of CPU time waste due to jobs that were evicted (MaxWallTimeMins was 8 hours but average runtime was 20hours).Describe the bug Jobs run longer than their declared
MaxWallTimeMins
and they get killed if they are scheduled on a pilot that has not enough time left causing CPU inefficiencies.Expected behavior WMAgent should catch a job exceeding the declared walltime and kill it so a nice error could be reported and workflows can be killed. Otherwise nobody notice this since evicted jobs get transparently rescheduled.