Increase job priority based on the number of retry

amaltaro commented 5 years ago

Analysing the current situation of the production agents, the amount of agents and jobs in the system, and the fact that jobs are mostly starving in the old draining agents (as design, since the pool obeys to job priority), it's clear we need to come up with a mechanism such that jobs don't starve to death (after many retries).

The change I have in mind is very simple, bump the job priority according to the job retry, something like

jobPriority = jobPriority + (10k * jobRetry)

and the pros are:

we can hopefully drain agents in <= 3 weeks
Ops don't need to ACDC tens/hundreds of workflows (exit code: 71305 - pending for too long)
the collaboration don't need to wait for > 20 days to know that many jobs failed and ACDC is needed

the only downside is of course the so called priority inversion (because the request priority won't be changed, only jobs running beyond the first attempt).

CompOps/PdmV feedback is important and will decide whether we implement it or not.

Open question though is whether we apply these priority changes to: a) every single agent, regardless of its status b) or only to agents being drained ?

vlimant commented 5 years ago

is this from https://its.cern.ch/jira/browse/CMSCOMPPR-5709 ?

vlimant commented 5 years ago

FYI : unified has the mechanism of increasing the priority in similar manner

https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/completor.py#L233

and this was disabled

https://github.com/CMSCompOps/WmAgentScripts/blob/master/unifiedConfiguration.json#L147

because this lead to many low priority stuff getting promoted. What you propose is close to the same and might have the same fate

amaltaro commented 5 years ago

Thanks for the feedback, Jean-Roch. As I mentioned over slack, the impact of this change is supposed to be much smaller, because it does not change the workflow priority itself, only a subset of jobs that are in a retry > 0.

vlimant commented 5 years ago

is that a blind retry>0 priority increase ? which means we are going to run failing jobs first? if this is for 71305, they are low priority and by the time they end up on the console time will have passed (3 retries of 5 days time-out) ; by increasing the priority, you'd short circuit this natural long waiting time for low priority things that are not finding a slot to run.

if the issue is with draining the agent, it's a totally different question and I'd rather have a mechanism that early-fails the jobs back to the workpool or in ACDC server. I am not satisfied with running lower priority work before higher priority one to circumvent the issue that work cannot be transferred to other agents ; I thought this was something we'd be tackling with containerized software for the agent.

amaltaro commented 5 years ago

is that a blind retry>0 priority increase ? which means we are going to run failing jobs first?

If you're asking whether it depends on the exit code, no, it doesn't. If there is a job retry, then there is an increase in job priority.

if the issue is with draining the agent, it's a totally different question and I'd rather have a mechanism that early-fails the jobs back to the workpool or in ACDC server.

Yes, this was my main motivation to create this issue/proposal. I just tried to address it from an angle that doesn't cause Ops overhead.

If we decide that an agent in drain doesn't retry any jobs at all, I'm happy with that solution as well. However, it would increase the burden on the Ops team.

FYI @bbockelm in case you have any thoughts on this subject as well.

aperezca commented 5 years ago

Hi Alan,

My worry about your proposal is that it's contributing to increasing the "diversity of jobs" (measured as autoclusters, or resource request lists, etc) that schedds and negotiators will have to handle. By creating more priority levels, this will contribute to even more diversify the load, breaking the job clusters, therefore increasing the burden on the matchmaking phase. On how to solve the draining agents problem: can't we return work back to the work queue, or shift load between agents/schedds? That would be a more effective approach perhaps.

Antonio.

amaltaro commented 4 years ago

We should evaluate what's proposed by @vlimant on this Ops PR: https://github.com/CMSCompOps/WmAgentScripts/pull/453

and if it's sound, then get this implemented directly in the agent.

dmwm / WMCore

Increase job priority based on the number of retry #9244