TheRacetrack / racetrack

An opinionated framework for deploying, managing, and serving application workloads
https://theracetrack.github.io/racetrack/
Apache License 2.0
28 stars 5 forks source link

Clean up the messy job after deployment fails #436

Closed iszulcdeepsense closed 2 months ago

iszulcdeepsense commented 3 months ago

If a failure happens during deploying a job to Kubernetes, Racetrack might end up in a transient state, where the job exists in a database and it created some resources in Kubernetes, but the pod is not running properly due to initialization error (either if it's model's or cluster's fault). Anyway, Racetrack should clean up this kubernetes resources garbage, if a model failed to start for any reason.

However, it may take a lot of time to initialize some of the heavy jobs and create kubernetes pods for them (for instance, just pulling the image might take even around 8 minutes). Having that in mind, the hard timeout should be increased to at least 15 minutes. Racetrack client should also be able to extend this time-out, if needed.