OpenFn / kit

The bits & pieces that make OpenFn work. (diagrammer, cli, compiler, runtime, runtime manager, logger, etc.)
10 stars 9 forks source link

Run Timeout design/reconciliation #414

Open josephjclark opened 11 months ago

josephjclark commented 11 months ago

What is a timeout?

The runtime supports a timeout, which is actually a timeout for each job. If the timeout is set at 1 minute, then each job has 1 minute to run. Not an unreasonable design, biased by the initial runtime implementation which just executed a single expression, not a whole workflow.

Lightning will want to send a timeout to the Worker for an attempt. Is this a timeout for the whole attempt, or per run?

I don't know whether we should support AttemptTimeout and JobTimeout as two distinct options, or whether we should only support an attempt timeout down in the runtime, or what.

This needs a bit of wider thinking.

Design Overview

josephjclark commented 11 months ago

I am about to implement a "quick fix" for the worker, where the engine has a timeout for the whole workflow, and the runtime itself does not timeout.

josephjclark commented 10 months ago

Attempt timeouts are probably only interesting to a) Lightning (to track lost attempts) and b) to kubernetes, to know how long to allow for a graceful shutdown when a pod is closed down. In other words, external services.

I think the run, which is the unit of work, is the thing you want to timeout out. If a job takes 4 minutes to complete, the next downstream job may only have a minute left to run. Why should it be affected by the earlier job being slow?

Here's what I think we need with this:

All timeouts should be set in minutes (not ms)

All timeouts should be prefixed (ie, jobTimeout) so that it's always clear what we mean.

This needs to be agreed on the Lightning side.

josephjclark commented 10 months ago

I've opened this up on Lightning, which I think is a better way to handle lost (but has nothing to do with anything in kit) https://github.com/OpenFn/Lightning/issues/1410

josephjclark commented 10 months ago

An update! So Lightning is also interested in an attempt duration for billing purposes. We may wish to kill long-running attempts (not runs, because the attempt is the unit of value and the split of runs is arbitrary) at some billing tiers.

This is a strategic timeout, rather than an error handling one.

So in addition to the above, the worker or engine WILL have an attempt timeout as a distinct option. I prefer this to live in the Worker, which means the engine needs to expose a cancel function. Which it does anyway, so that all makes sense.

josephjclark commented 10 months ago

Raised #500 and made the attempt timeout a different issue.

josephjclark commented 8 months ago

This is now mostly done.

The actual runtime needs to recognise a runTimeout and a stepTimeout.

But both are super low priority. The CLI doesn't really need either, and lightning manages the runTimeout through the engine.

josephjclark commented 4 months ago

A big input to this:

All this applies to run timeouts, not step timeouts (which don't exist right now but we need to use clear nomenclature

This probably means removing DEFAULT_TIMEOUT_MS in the runtime