Retry a job call when a job gets restarted

Occasionally, a job's pod might unexpectedly terminate due to an Out-of-memory kill, or it might restart due to an upgrade or other reasons. This can disrupt ongoing job calls, particularly those that take a long time, resulting in unexpected error respones for the user. We should consider implementing a mechanism to retry the request after a pod restarts.

In particular, this could align with the issue https://github.com/TheRacetrack/racetrack/issues/412. As the tasks are managed by the Pub service, aborted requests could be recoverable
However, we should be cautious about naively repeating failed requests, as this could potentially lead to an endless loop of requests with no way to stop it.
Automatic retries are risky and there's a lot of things that can get restarted, including: Kubernetes nodes, ingress, Racetrack's Pub, Job pods.
Consider Message Queue like Rabbit or Kafka to keep tasks truly persistent and to survive the restart.
How to distinguish that the restart was expected or not? Whether it was a scheduled upgrade or pod died because of an unexpected failure.

TheRacetrack / racetrack

Retry a job call when a job gets restarted #424