bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
641 stars 85 forks source link

Job rescheduling policies #4015

Open wdbaruni opened 1 month ago

wdbaruni commented 1 month ago

Today we only retry failed executions once on each node, and will fail only if there are no more nodes to retry on. For smaller networks, we might be failing too early even for transient issues that can benefit for a retry after a backoff, and for larger networks we might be better failing earlier than retrying on each individual node.

This feature is about giving more control to the users on how they want to define job rescheduling policies regardless of the nodes in the network. Users should be able to define:

Note that this is different than https://github.com/bacalhau-project/bacalhau/issues/3987, as this feature is about how we handle and retry failures, whereas https://github.com/bacalhau-project/bacalhau/issues/3987 is about retrying if no nodes were found

### Tasks
- [ ] #3988
- [ ] #3989 
- [ ] #3990 
- [ ] #3994