Open josephjclark opened 11 months ago
I am about to implement a "quick fix" for the worker, where the engine has a timeout for the whole workflow, and the runtime itself does not timeout.
Attempt timeouts are probably only interesting to a) Lightning (to track lost attempts) and b) to kubernetes, to know how long to allow for a graceful shutdown when a pod is closed down. In other words, external services.
I think the run, which is the unit of work, is the thing you want to timeout out. If a job takes 4 minutes to complete, the next downstream job may only have a minute left to run. Why should it be affected by the earlier job being slow?
Here's what I think we need with this:
All timeouts should be set in minutes (not ms)
All timeouts should be prefixed (ie, jobTimeout
) so that it's always clear what we mean.
This needs to be agreed on the Lightning side.
I've opened this up on Lightning, which I think is a better way to handle lost (but has nothing to do with anything in kit) https://github.com/OpenFn/Lightning/issues/1410
An update! So Lightning is also interested in an attempt duration for billing purposes. We may wish to kill long-running attempts (not runs, because the attempt is the unit of value and the split of runs is arbitrary) at some billing tiers.
This is a strategic timeout, rather than an error handling one.
So in addition to the above, the worker or engine WILL have an attempt timeout as a distinct option. I prefer this to live in the Worker, which means the engine needs to expose a cancel function. Which it does anyway, so that all makes sense.
Raised #500 and made the attempt timeout a different issue.
This is now mostly done.
The actual runtime needs to recognise a runTimeout
and a stepTimeout
.
But both are super low priority. The CLI doesn't really need either, and lightning manages the runTimeout
through the engine.
A big input to this:
All this applies to run timeouts, not step timeouts (which don't exist right now but we need to use clear nomenclature
This probably means removing DEFAULT_TIMEOUT_MS
in the runtime
What is a timeout?
The runtime supports a timeout, which is actually a timeout for each job. If the timeout is set at 1 minute, then each job has 1 minute to run. Not an unreasonable design, biased by the initial runtime implementation which just executed a single expression, not a whole workflow.
Lightning will want to send a timeout to the Worker for an attempt. Is this a timeout for the whole attempt, or per run?
I don't know whether we should support AttemptTimeout and JobTimeout as two distinct options, or whether we should only support an attempt timeout down in the runtime, or what.
This needs a bit of wider thinking.
Design Overview