TheRacetrack / racetrack

An opinionated framework for deploying, managing, and serving application workloads
https://theracetrack.github.io/racetrack/
Apache License 2.0
28 stars 5 forks source link

Informative error about crash during Job initialization #469

Open iszulcdeepsense opened 4 weeks ago

iszulcdeepsense commented 4 weeks ago

(issue reported by Rasmus)

When a job does heavy computation at its startup (initialization step), it may cause it to crash, especially due to Out-of-memory kill. Finally, it gets killed by kubernetes, producing logs (and that's normal):

[2024-06-03 11:01:52] INFO  GET /ready 500
...
[2024-06-03 10:02:21] INFO  uvicorn.error: Shutting down
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2024-06-03 10:02:21] INFO  uvicorn.error: Waiting for application shutdown.
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2024-06-03 10:02:21] INFO  uvicorn.error: Application shutdown complete.
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2024-06-03 10:02:21] INFO  uvicorn.error: Finished server process [1]
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2024-06-03 10:02:21] INFO  received signal 15, shutting down...

received signal 15 - this is SIGTERM - a graceful termination request made by Kubernetes.

and while it's restarting, the Lifecycle server still checks if the Job is ready, which ends with a misleading error showed to a user:

[2024-06-03 10:23:53] ERROR deployment error: verifying deployed job: Request GET http://job-*-v-0-0-1-artifactory-.ikp-rt.svc:7000/ready failed: <urlopen error [Errno 111] Connection refused

or

[2024-06-03 10:21:53] ERROR deployment error: verifying deployed job: Request to http://job-*-v-0-0-1-artifactory.ikp-rt.svc:7000/ready failed: [Errno 104] Connection reset by peer

Let's think if we can do better in at least producing informative errors in such case.