AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81 stars 23 forks source link

Allow SIGTERM error code to be returned from XPK #162

Closed Obliviour closed 3 months ago

Obliviour commented 3 months ago

When SIGTERM is caught by XPK for graceful migration, we hold to the exit code and return 0. However in the SIGTERM case, we should always propagate the error code (143 / SIGTERM) so that if the jobset has --max-restarts set > 0, the job can properly restart.

Fixes / Features

Testing / Documentation

Caught a sigterm properly in XPK and the jobset restarted.