Today we only retry failed executions once on each node, and will fail only if there are no more nodes to retry on. For smaller networks, we might be failing too early even for transient issues that can benefit for a retry after a backoff, and for larger networks we might be better failing earlier than retrying on each individual node.
This feature is about giving more control to the users on how they want to define job rescheduling policies regardless of the nodes in the network. Users should be able to define:
Attempts: Limits the number of rescheduling attempts that can occur in an interval.
Interval: A duration in which we can limit the number of rescheduling attempts.
BaseBackoff: A minimum duration to wait between rescheduling attempts. The backoff strategy determines how much subsequent rescheduling attempts are delayed by.
BackoffStrategy: Determines how the backoff progressively changes on subsequent rescheduling attempts. Valid values are "exponential" and "constant".
MaxBackoff: An upper bound on the backoff.
Unlimited: Allows infinite rescheduling attempts. Only allowed when backoff is set between rescheduling attempts.
Today we only retry failed executions once on each node, and will fail only if there are no more nodes to retry on. For smaller networks, we might be failing too early even for transient issues that can benefit for a retry after a backoff, and for larger networks we might be better failing earlier than retrying on each individual node.
This feature is about giving more control to the users on how they want to define job rescheduling policies regardless of the nodes in the network. Users should be able to define:
Attempts
: Limits the number of rescheduling attempts that can occur in an interval.BaseBackoff
: A minimum duration to wait between rescheduling attempts. The backoff strategy determines how much subsequent rescheduling attempts are delayed by.BackoffStrategy
: Determines how the backoff progressively changes on subsequent rescheduling attempts. Valid values are "exponential" and "constant".MaxBackoff
: An upper bound on the backoff.Unlimited
: Allows infinite rescheduling attempts. Only allowed when backoff is set between rescheduling attempts.Note that this is different than https://github.com/bacalhau-project/bacalhau/issues/3987, as this feature is about how we handle and retry failures, whereas https://github.com/bacalhau-project/bacalhau/issues/3987 is about retrying if no nodes were found