When a job does heavy computation at its startup (initialization step), it may cause it to crash, especially due to Out-of-memory kill.
Finally, it gets killed by kubernetes, producing logs (and that's normal):
[2m[2024-06-03 11:01:52][0m [0;34mINFO [0m GET /ready 500
...
[2m[2024-06-03 10:02:21][0m [0;34mINFO [0m uvicorn.error: Shutting down
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2m[2024-06-03 10:02:21][0m [0;34mINFO [0m uvicorn.error: Waiting for application shutdown.
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2m[2024-06-03 10:02:21][0m [0;34mINFO [0m uvicorn.error: Application shutdown complete.
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2m[2024-06-03 10:02:21][0m [0;34mINFO [0m uvicorn.error: Finished server process [[36m1[0m]
job-artifactory-test-v-0-0-1-artifactorytest-5b695bf7fc-tf77w
[2m[2024-06-03 10:02:21][0m [0;34mINFO [0m received signal 15, shutting down...
received signal 15 - this is SIGTERM - a graceful termination request made by Kubernetes.
and while it's restarting, the Lifecycle server still checks if the Job is ready, which ends with a misleading error showed to a user:
(issue reported by Rasmus)
When a job does heavy computation at its startup (initialization step), it may cause it to crash, especially due to Out-of-memory kill. Finally, it gets killed by kubernetes, producing logs (and that's normal):
received signal 15
- this is SIGTERM - a graceful termination request made by Kubernetes.and while it's restarting, the Lifecycle server still checks if the Job is ready, which ends with a misleading error showed to a user:
or
Let's think if we can do better in at least producing informative errors in such case.