Informative error about crash during Job initialization

(issue reported by Rasmus)

When a job does heavy computation at its startup (initialization step), it may cause it to crash, especially due to Out-of-memory kill. Finally, it gets killed by kubernetes, producing logs (and that's normal):

[2m[2024-06-03 11:01:52][0m [0;34mINFO [0m GET /ready 500
...
[2m[2024-06-03 10:02:21][0m [0;34mINFO [0m uvicorn.error: Shutting down
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2m[2024-06-03 10:02:21][0m [0;34mINFO [0m uvicorn.error: Waiting for application shutdown.
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2m[2024-06-03 10:02:21][0m [0;34mINFO [0m uvicorn.error: Application shutdown complete.
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2m[2024-06-03 10:02:21][0m [0;34mINFO [0m uvicorn.error: Finished server process [[36m1[0m]
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2m[2024-06-03 10:02:21][0m [0;34mINFO [0m received signal 15, shutting down...

received signal 15 - this is SIGTERM - a graceful termination request made by Kubernetes.

and while it's restarting, the Lifecycle server still checks if the Job is ready, which ends with a misleading error showed to a user:

[2024-06-03 10:23:53] ERROR deployment error: verifying deployed job: Request GET http://job-*-v-0-0-1-artifactory-.ikp-rt.svc:7000/ready failed: <urlopen error [Errno 111] Connection refused

[2024-06-03 10:21:53] ERROR deployment error: verifying deployed job: Request to http://job-*-v-0-0-1-artifactory.ikp-rt.svc:7000/ready failed: [Errno 104] Connection reset by peer

Let's think if we can do better in at least producing informative errors in such case.

TheRacetrack / racetrack

Informative error about crash during Job initialization #469