dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

WMAgent fails jobs that stay idle too long, causes problems on some HPC #10447

Open hufnagel opened 3 years ago

hufnagel commented 3 years ago

Impact of the bug Jobs in HPC workflows are never run and (eventually permanently) failed by the agent because they stay idle for too long.

Describe the bug Not sure it's really a bug, more like a feature that has undesired consequences on some HPC. Basically, the agent kills jobs that stay pending for too long and resubmits them. This seems to count again the retry limit. If it happens too often, the job is permanently failed. Some HPC resources (T3_US_ANL for instance) have very slow resource provisioning, which means the time between job submission and job start can be a few weeks.

How to reproduce it Assign a workflow to a single site that has very slow resource provisioning.

Expected behavior Since there is nothing we can do about the speed of the resource provisioning, we need a way to account for that in the agent somehow and not just have it assume that things are broken and permanently fail the jobs.

amaltaro commented 3 years ago

@hufnagel Dirk, this automatic job removal is configurable at JobStatusLite level. However, note that it applies globally to the agent, not specific to any site and/or workflow.

The configuration change is:

config.JobStatusLite.stateTimeouts = {'Running': 169200, 'Pending': 432000, 'Error': 300}

where Pending maps to jobs idle in the condor queue. That value is in seconds and the current value defines that jobs pending for 5 days will be removed by JobStatusLite.

So, we can increase that value in the HEPCloud agent and restart JobStatusLite. Please let us know what the value should be and we can do that.

hufnagel commented 3 years ago

Can you set it to 30 days? You know which agent to change?

klannon commented 3 years ago

To try to capture some of the Slack discussion: At the moment, this is seen as something requiring a quick fix (i.e. change an agent to have a really long time out and then use that agent just to submit to the relevant HPC site). However, longer term, we will have a collection of resources that might have different latencies, and we'll need a way to tune the policy (e.g. how long to wait before giving up) on a site-by-site basis. If that's not a fair summary, then please @hufnagel or @amaltaro, add additional comments to correct or clarify!

Let me point out one additional wrinkle: I don't really think that this is something that we want to configure at either the agent or the site level. It's also a property of the workflow. What I mean by this is that for one workflow, it might make sense to wait patiently in a queue for one month or more while for another workflow it might make more sense to give up after a few days and retry somewhere else. So, for me, it seems like this is more of a "matchmaking" problem where workflows need to express their tolerance for long in queue wait times and resources need to advertise their typical (90% confidence level historical high and low values) of queue wait time and then the SI layer needs to try to match jobs from these workflows to the apporpriate resources. I'm worried that any other approach (e.g. WMAgent setting time out values based on some database of site properties) would become operationally unsustainable. In fact, the more I think about it, there might be other parameters we want to consider, like not submitting jobs from workflows that are more than some fraction done (e.g. 90%) to resources where the jobs could wait for one month to run.

I wonder what would be the right layer(s) to implement this? E.g. would it be enough for the WM system just to populate a set of job attributes and then let the SI layer handle the matching and when appropriate restarting of the jobs as necessary? How is it currently implemented? Does the agent actively kill the job? Or is it that a time-out value is passed on to SI?

hufnagel commented 3 years ago

I think the agent actively kills the job. The SI/HTCondor layer is happy with jobs that are pending forever if they don't find any resource they can match.