xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81
stars
23
forks
source link
Allow SIGTERM error code to be returned from XPK #162
When SIGTERM is caught by XPK for graceful migration, we hold to the exit code and return 0. However in the SIGTERM case, we should always propagate the error code (143 / SIGTERM) so that if the jobset has --max-restarts set > 0, the job can properly restart.
Fixes / Features
Allows XPK to restart when we have a sigterm.
Testing / Documentation
Caught a sigterm properly in XPK and the jobset restarted.
[ y ] Tests pass
[ y ] Appropriate changes to documentation are included in the PR
When SIGTERM is caught by XPK for graceful migration, we hold to the exit code and return 0. However in the SIGTERM case, we should always propagate the error code (143 / SIGTERM) so that if the jobset has --max-restarts set > 0, the job can properly restart.
Fixes / Features
Testing / Documentation
Caught a sigterm properly in XPK and the jobset restarted.