Run Timeout design/reconciliation

josephjclark commented 11 months ago

What is a timeout?

The runtime supports a timeout, which is actually a timeout for each job. If the timeout is set at 1 minute, then each job has 1 minute to run. Not an unreasonable design, biased by the initial runtime implementation which just executed a single expression, not a whole workflow.

Lightning will want to send a timeout to the Worker for an attempt. Is this a timeout for the whole attempt, or per run?

I don't know whether we should support AttemptTimeout and JobTimeout as two distinct options, or whether we should only support an attempt timeout down in the runtime, or what.

This needs a bit of wider thinking.

Design Overview

Run timeout is tracked in runtime AND engine
Attempt timeout is tracked in worker

josephjclark commented 11 months ago

I am about to implement a "quick fix" for the worker, where the engine has a timeout for the whole workflow, and the runtime itself does not timeout.

josephjclark commented 10 months ago

Attempt timeouts are probably only interesting to a) Lightning (to track lost attempts) and b) to kubernetes, to know how long to allow for a graceful shutdown when a pod is closed down. In other words, external services.

I think the run, which is the unit of work, is the thing you want to timeout out. If a job takes 4 minutes to complete, the next downstream job may only have a minute left to run. Why should it be affected by the earlier job being slow?

Here's what I think we need with this:

The runtime continues to have a timeout per job/run.
The worker should double-buffer the same timeout per run. This should be managed from the main thread (with maybe a 10% grace period so that the runtime fails first), and allows us to trap things like infinite loops, or the thread crashing silently (although I don't think that can happen)
The worker accepts a global run-timeout flag. Attempts can override this via an option on the attempt object.
The engine or worker has no timeout for the whole attempt/workflow

All timeouts should be set in minutes (not ms)

All timeouts should be prefixed (ie, jobTimeout) so that it's always clear what we mean.

This needs to be agreed on the Lightning side.

josephjclark commented 10 months ago

I've opened this up on Lightning, which I think is a better way to handle lost (but has nothing to do with anything in kit) https://github.com/OpenFn/Lightning/issues/1410

josephjclark commented 10 months ago

An update! So Lightning is also interested in an attempt duration for billing purposes. We may wish to kill long-running attempts (not runs, because the attempt is the unit of value and the split of runs is arbitrary) at some billing tiers.

This is a strategic timeout, rather than an error handling one.

So in addition to the above, the worker or engine WILL have an attempt timeout as a distinct option. I prefer this to live in the Worker, which means the engine needs to expose a cancel function. Which it does anyway, so that all makes sense.

josephjclark commented 10 months ago

Raised #500 and made the attempt timeout a different issue.

josephjclark commented 8 months ago

This is now mostly done.

The actual runtime needs to recognise a runTimeout and a stepTimeout.

But both are super low priority. The CLI doesn't really need either, and lightning manages the runTimeout through the engine.

josephjclark commented 4 months ago

A big input to this:

The runtime should NEVER set a default timeout
If a plan does NOT have a timeout, try to load a default one from the run options objects
If there's no timeout, it's fine! We just run forever
The CLI and Worker should both set a default timeout on the run options, but of course a plan can override this.

All this applies to run timeouts, not step timeouts (which don't exist right now but we need to use clear nomenclature

This probably means removing DEFAULT_TIMEOUT_MS in the runtime

OpenFn / kit

Run Timeout design/reconciliation #414

Design Overview