TheRacetrack / racetrack

An opinionated framework for deploying, managing, and serving application workloads
https://theracetrack.github.io/racetrack/
Apache License 2.0
28 stars 5 forks source link

Retry a job call when a job gets restarted #424

Closed iszulcdeepsense closed 3 months ago

iszulcdeepsense commented 4 months ago

Occasionally, a job's pod might unexpectedly terminate due to an Out-of-memory kill, or it might restart due to an upgrade or other reasons. This can disrupt ongoing job calls, particularly those that take a long time, resulting in unexpected error respones for the user. We should consider implementing a mechanism to retry the request after a pod restarts.

iszulcdeepsense commented 4 months ago

Loose thoughts: