Occasionally, a job's pod might unexpectedly terminate due to an Out-of-memory kill, or it might restart due to an upgrade or other reasons.
This can disrupt ongoing job calls, particularly those that take a long time, resulting in unexpected error respones for the user.
We should consider implementing a mechanism to retry the request after a pod restarts.
However, we should be cautious about naively repeating failed requests,
as this could potentially lead to an endless loop of requests with no way to stop it.
Automatic retries are risky and there's a lot of things that can get restarted, including: Kubernetes nodes, ingress, Racetrack's Pub, Job pods.
Consider Message Queue like Rabbit or Kafka to keep tasks truly persistent and to survive the restart.
How to distinguish that the restart was expected or not? Whether it was a scheduled upgrade or pod died because of an unexpected failure.
Is it possible to get the reason of the pod's restart from Kubernetes?
Having maximum number of retry attempts sounds reasonable. When doing a scheduled upgrade, things are supposed to restart at most one time. For instance, if a pod restarted 3 times in a row, we could abort a job call and report an error, breaking the infinite loop.
Occasionally, a job's pod might unexpectedly terminate due to an Out-of-memory kill, or it might restart due to an upgrade or other reasons. This can disrupt ongoing job calls, particularly those that take a long time, resulting in unexpected error respones for the user. We should consider implementing a mechanism to retry the request after a pod restarts.