bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
643 stars 85 forks source link

Document different timeout types #4083

Open wdbaruni opened 2 weeks ago

wdbaruni commented 2 weeks ago

ExecutionTimeout is the time a single execution should take and that an execution should be failed it takes longer. When an execution fails in one node, even due to ExecutionTimeout, we can retry on another node. ExecutionTimeout will be reset when an execution is rescheduled on another node

TotalTimeout covers the time end to end from when the job was submitted. So it includes all executions, retries, and also the time job spent being scheduled.

If a user only defines ExecutionTimeout, then queueing is not enabled, and we only fail executions due to timeouts and not the job itself. We will fail a job if it exhausted all of its retries. Users mainly define ExecutionTimeout if they want to preserve resources and avoid allocating compute resources for a job more than what they should've

If a user only defines QueueTimeout, then queueing is enabled and a job/execution can run indefinitely in a compute node until it completes and won't be interrupted by bacalhau

If a user only defines TotalTimeout, then queueing is disabled, an execution and the job will be marked as failed after the timeout and there won't be room to retry on another node.

I expect a combination of QueueTimeout and TotalTimeout to make more sense where TotalTimeout needs to be higher than QueueTimeout, and the option ExecutionTimeout to only make sense to power users who want more control

Reference: