Currently, Jobson only attempts a job once then gives up with "success", "failure", or "aborted". However, some (internal) workflows would benefit from an attempts API that re-runs jobs automatically when they fail. The support would need to:
Rerun a job n times (e.g. "Have at least 3 attempts, directly after one and another"): This is useful for flakey jobs that sometimes fail because (e.g.) a server was down intermittently
Rerun a job when manually prompted (e.g. "Have n attempts, then wait for a client to explicitly tell you to attempt again): This is useful for jobs that rely on 3rd-party services that can be down for larger periods of time (e.g. Hadoop clusters with maintenance windows)
Ideally, this feature can be integrated into the existing API with no breaking changes. Maybe not, though, because downstream users are going to start seeing jobs with outputs changing (e.g. they request stdout from attempt 1, which is different from the stdout from attempt #2). I'd need to ensure there's no caching or immutability expectations in downstream clients.
This feature would reduce the amount of resubmissions made in prod (e.g. when a cluster is down) and enable developers/end-users to just rerun something under the existing job ID (rather than having to create a whole new job).
Currently, Jobson only attempts a job once then gives up with "success", "failure", or "aborted". However, some (internal) workflows would benefit from an attempts API that re-runs jobs automatically when they fail. The support would need to:
n
times (e.g. "Have at least 3 attempts, directly after one and another"): This is useful for flakey jobs that sometimes fail because (e.g.) a server was down intermittentlyn
attempts, then wait for a client to explicitly tell you to attempt again): This is useful for jobs that rely on 3rd-party services that can be down for larger periods of time (e.g. Hadoop clusters with maintenance windows)Ideally, this feature can be integrated into the existing API with no breaking changes. Maybe not, though, because downstream users are going to start seeing jobs with outputs changing (e.g. they request
stdout
from attempt 1, which is different from the stdout from attempt #2). I'd need to ensure there's no caching or immutability expectations in downstream clients.This feature would reduce the amount of resubmissions made in prod (e.g. when a cluster is down) and enable developers/end-users to just rerun something under the existing job ID (rather than having to create a whole new job).